Uploaded by Nirmala SR

NIST 99 martin2000

advertisement
Digital Signal Processing 10, 1–18 (2000)
doi:10.1006/dspr.1999.0355, available online at http://www.idealibrary.com on
The NIST 1999 Speaker Recognition
Evaluation—An Overview
Alvin Martin and Mark Przybocki
National Institute of Standards and Technology, 100 Bureau Drive, Stop 8940,
Gaithersburg, Maryland 20899-8940
E-mail: alvin.martin@nist.gov, mark.przybocki@nist.gov
Martin, Alvin, and Przybocki, Mark, The NIST 1999 Speaker Recognition Evaluation—An Overview, Digital Signal Processing 10 (2000), 1–18.
This article summarizes the 1999 NIST Speaker Recognition Evaluation.
It discusses the overall research objectives, the three task definitions, the
development and evaluation data sets, the specified performance measures
and their manner of presentation, the overall quality of the results. More
than a dozen sites from the United States, Europe, and Asia participated
in this evaluation. There were three primary tasks for which automatic
systems could be designed: one-speaker detection, two-speaker detection,
and speaker tracking. All three tasks were performed in the context of mulaw encoded conversational telephone speech. The one-speaker detection
task used single channel data, while the other two tasks used summed
two-channel data. About 500 target speakers were specified, with 2 min of
training speech data provided for each. Both multiple and single speaker
test segments were selected from about 2000 conversations that were
not used for training material. The duration of the multiple speaker test
data was nominally 1 min, while the duration of the single speaker test
segments varied from near zero up to 60 s. For each task, systems had to
make independent decisions for selected combinations of a test segment
and a hypothesized target speaker. The data sets for each task were
designed to be large enough to provide statistically meaningful results
on test subsets of interest. Results were analyzed with respect to various
conditions including duration, pitch differences, and handset types.  2000
Academic Press
Key Words: speaker recognition; speaker verification; speaker detection;
speaker tracking; DET curve; NIST evaluation.
1. INTRODUCTION
The 1999 NIST Speaker Recognition Evaluation was the latest in an ongoing
series of yearly evaluations conducted by NIST. These evaluations aim to
1051-2004/00 $35.00
Copyright  2000 by Academic Press
All rights of reproduction in any form reserved.
2
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
provide an important contribution to the direction of research efforts and the
calibration of technical capabilities. They are intended to be of interest to
all researchers working on the general problem of text independent speaker
recognition. To this end the evaluations are designed to be simple, to focus on
core technology issues, to be fully supported, and to be accessible.
This evaluation focused on speaker detection and tracking tasks in the context
of conversational telephone speech. The evaluation was designed to foster
research progress, with the objectives of:
1. exploring promising new ideas in speaker recognition,
2. developing advanced technology incorporating these ideas, and
3. measuring the performance of this technology.
There were 16 participating sites in the 1999 evaluation, some working in
collaborative partnerships. Included were four sites from the United States, one
from India, and eleven from Europe. Six of the European sites or partnerships
shared some resources and algorithms in a cooperative arrangement known
FIG. 1. Participants. Listed are the 16 sites that participated in the 1999 NIST Speaker
Recognition Evaluation. Represented are four sites from the United States, one from India, and
eleven from Europe. Six groups submitted results for the one-speaker detection task only. One group
submitted results for the both the one-speaker detection task and the speaker tracking task, while
five groups submitted results for all three tasks. There were two collaborative partnerships across
sites and six of the European groups (identified by •) shared some resources and algorithms in a
cooperative arrangement known as the ELISA Consortium.
Martin and Przybocki: NIST 1999 Speaker Recognition Evaluation
3
as the ELISA Consortium [9]. Figure 1 summarizes the organizations that
participated and the tasks for which they submitted system results.
Information on the NIST speaker recognition evaluations, including the official evaluation plans for past evaluations, is available on the NIST website [1].
2. THE TASKS
The 1999 evaluation consisted of three separate tasks, two of which were
new to the annual NIST speaker recognition evaluations. Each task consisted
of several thousand trials. A trial consisted of a single hypothesized speaker and
a specific test segment. The system was required to make an actual (true or
false) decision on whether (or in the case of the speaker tracking task, where)
the specified speaker was present in the test segment. Along with each actual
decision, systems were also required to provide for each trial a likelihood score
indicating the degree of confidence in the decision. Higher scores indicated
greater confidence in the presence of the speaker. A trial where the hypothesized
speaker was present in the test segment (correct answer “true”) is referred to
as a target trial. Other trials (correct answer “false”) are referred to as impostor
trials.
2.1. One-Speaker Detection
This task is the basic speaker recognition task used in the previous NIST
evaluations. The task is to determine whether a specified speaker is speaking
in a given single channel segment of mu-law encoded telephone speech. The
hypothesized speakers are always of the same sex as the segment speaker.
2.2. Two-Speaker Detection
This task is the same as the one-speaker detection task, except that the speech
segments include both sides of a telephone call with two speakers present and
the channels summed, rather than being limited to a single side and speaker.
Note that the task is to determine whether the one specified speaker is present
in the combined signal. The segment speakers may be of the same or opposite
sex, but the hypothesized speaker is always of the same sex as at least one of the
segment speakers.
2.3. Speaker Tracking
This task is to perform speaker detection as a function of time. Systems were
required to identify the exact time intervals when the hypothesized speaker was
speaking. For each identified time interval, an actual decision and a likelihood
score were required. The task used a subset of the multiple speaker test data
that was used for the two-speaker detection task.
4
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
3. DATA
The data for the evaluation came from the Switchboard-2 Phase 3 Corpus
collected by the Linguistic Data Consortium (LDC) [2]. This corpus consists of
2728 5-min conversations involving 640 speakers. The participating speakers
were mainly college students in the southern United States and did not know
one another. They were recruited through campus flyers and local newspaper
advertisements and paid a nominal fee for their participation. Each speaker
was allowed to initiate and receive at most one call per day. Received calls were
always on the same phone line, while initiated calls were required to all be made
from distinct phone lines. Speakers were paired through an automaton that
was called by the person initiating the call. A suggested topic was provided, but
because subjects were not required to talk about the topic, college-type chitchat
dominated the conversations.
3.1. Training
Training data were provided for all hypothesized speakers of all trials. Such
speakers are referred to as target speakers. The training data for each target
speaker consisted of 1 min of speech from each of two different conversations
using the same phone number and thus, presumably, the same handset. Each
1-min segment consisted of consecutive turns of the speaker with areas of
silence removed. These data were always selected from near the end of the
conversation. (The end seemed a better, more conversational, choice than
the beginning, which might contain more formal introductions.) The actual
durations of the training segments were allowed to vary in the range of 55–65 s
so as to include only whole turns wherever possible.
Training data were created for each speaker in the corpus for which there
were two available conversations using the same phone number. This resulted
in 230 male target speakers and 309 female target speakers.
3.2. One-Speaker Detection
The one-speaker detection task used single speaker test segments. These test
segments consisted of essentially the same data as in the two-speaker detection
task described below, but with each of the two channels in these segments
viewed as comprising a separate segment. As with the training segments, areas
of silence were removed in the one-speaker test segments and whole turns
included to the extent possible. Thus, the length of these segments varied from
close to zero up to a full minute. No more than one test segment was created
from each conversation side, and no test segments came from conversations
where training data were selected from either side. There were 3420 onespeaker detection segments.
For each segment, 11 target speakers were specified for judgment as
hypothesized speakers. One of these speakers was the actual speaker, provided
that the actual speaker was a target speaker as was the case for more than
90% of the test segments. The other ten were randomly selected from among all
target speakers of the same sex as the true speaker. Thus the total number
Martin and Przybocki: NIST 1999 Speaker Recognition Evaluation
5
of trials in this task was 37,620 with 3157 of them being target trials and
the remaining 34,463 being impostor trials. A target trial occurs when the
hypothesized speaker is the speaker in the test segment and an impostor trial is
when the hypothesized speaker is not.
3.3. Two-Speaker Detection
The two-speaker detection task used multiple speaker test segments. These
test segments consisted of summed two-sided intervals with durations of
59–61 s selected from near the end of conversations, with whole turns included
whenever possible. The duty cycle of target speakers varied from close to
zero up to 100%. No more than one test segment was created from each
conversation, and no test segments came from conversations where training
data were selected from either side. There were 1723 two-speaker detection
segments.
For each segment, 22 target speakers were specified for judgment as
hypothesized speakers. These consisted of the two sets of 11 speakers used in
corresponding single-sided segments in the one-speaker task. Note that two of
the 22 were usually actual speakers in the conversation and that the 22 speakers
were either all of the same sex or consisted of 11 speakers of each sex, depending
on the sexes of the two actual speakers of the test segment. The total number
or trials in this task was 37,906 with 3158 of them being target trials and the
remaining 34,748 being impostor trials.
3.4. Speaker Tracking
The test segments consisted of a subset of 1000 of the test segments for the
two-speaker detection task. Segments were chosen to include as many target
speakers as possible and in fact 507 target speakers were represented in these
segments. For each segment, the hypothesized speakers consisted of four of
the 22 used in the two-speaker detection task. These included the two true
speakers if they were target speakers. If the two true speakers were of the same
sex, all four hypothesized targets were of this sex; otherwise there were two
hypothesized targets of each sex.
4. PERFORMANCE MEASURE
Evaluation was performed separately for the three tasks. For each task the
formal evaluation measure was a detection cost function, denoted CDet , defined
as a weighted sum of the miss and false alarm error probabilities as determined
from a system’s actual decisions:
CDet = (CMiss · PMiss|Target · PTarget )
+ (CFalseAlarm · PFalseAlarm|NonTarget · PNonTarget ).
(1)
The required parameters in this function are the cost of a missed detection
(CMiss ), the cost of a false alarm (CFalseAlarm ), the a priori probability of a
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
6
target speaker (PTarget ), and the a priori probability of a non-target speaker
(PNonTarget ). In the 1999 evaluation the following parameter values were used:
CMiss = 10;
CFalseAlarm = 1;
PTarget = 0.01
PNonTarget = 1 − PTarget = 0.99.
(2)
The relatively greater cost of a miss compared to a false alarm is probably
realistic for many applications. The a priori probability of a target speaker was
more arbitrary and for the tracking task was probably not a realistic choice.
Note that this specified a priori probability need not, and in fact did not,
correspond to the actual percentage of target instances in the evaluation data.
For the detection tasks the decisions are discrete, and PMiss and PFalseAlarm
are defined by the counts of correct and incorrect decisions. For the tracking
task, a reference answer key was determined by applying an energy-based
automatic speech detector to each conversation side. PMiss and PFalseAlarm were
then computed as follows:
R
PMiss =
TargetSpeech δ(Dt , F ) dt
R
TargetSpeech dt
R
PFalseAlarm =
,
ImpostorSpeech δ(Dt , T ) dt
R
ImpostorSpeech dt
, (3)
where Dt is the system output as a function of time, and
(
δ(x, y) =
1
if x = y,
0
otherwise.
Systems were free to specify tracking decisions and scores down to the
centisecond level. This was perhaps a greater level of granularity than needed
and resulted in some very large submission files.
The performance of systems on a given task can be shown by bar charts of the
CDet scores. In such bar charts (an example may be seen in Fig. 3) we show the
separate contributions of the false alarm rates and the missed detection rates
to the total scores. We also use the decision likelihood scores to find all of the
possible operating points (false alarm and missed detection rate pairs) of each
system and present these as DET (detection error tradeoff) curves, a variant of
ROC curves using a normal deviate scale for each axis. (See [3] for a discussion
of DET curves.) We note with an open circle, , the point corresponding to
the actual decisions, and with a diamond, ♦, the point that would produce a
minimum CDet value. Examples appear in Section 5.
5. RESULTS
For each of the three tasks a subset of the full evaluation test set corresponding to the test conditions of greatest interest as specified in the evaluation plan
was defined. Figure 2 gives the details of these primary conditions and the numbers of target and impostor trials and speakers they encompassed. The primary
Martin and Przybocki: NIST 1999 Speaker Recognition Evaluation
7
FIG. 2. Primary conditions. The primary condition for each of the three evaluation tasks was
defined to include the trials reflecting the test conditions of greatest interest for research on the
core problems of speaker recognition as specified in the evaluation plan. Trials that satisfied these
conditions formed the basic set on which NIST reported the official results.
conditions included that the handsets used in both training and test have electret microphones and that different phone lines, and thus presumably different
handsets, be used for training and test in the target trials. (See Sections 6.3 and
6.4 below for a discussion of handset differences and microphone types.) Note
that phone lines are always different for impostor trials. The speaker durations
were required to be neither especially long nor especially short (in the 15–45 s
range). For the two-speaker detection and speaker tracking trials the primary
conditions had the additional restrictions that both handset microphones used
in the test be electret and that the two speakers be of the same sex.
Figure 3 shows a bar graph of the detection cost function scores (CDet ) for
the actual decisions by each system in the one-speaker detection task, using
the primary condition trials. Clearly, there is a great deal of variation in
system performance for this task. In accord with our understanding with the
participants, we are not identifying the various systems in this plot or the DET
plots presented hereafter, but for the identity of the best performing system for
each task see the discussion below.
Figure 4 presents the corresponding DET curves for the one-speaker detection task. Since the DET plots generally contain more information of interest in
terms of performance results and tradeoff, we hereafter concentrate on these.
Figures 5 and 6 plot DET curves of the various systems’ performance for the
two-speaker detection and the speaker tracking tasks, respectively, on the primary condition trials.
It may be observed from Figs. 4 and 5 that the two-speaker detection task
proved to be considerably harder than the one-speaker task. To some degree
this is to be expected. A likely approach to the two-speaker detection task
8
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
FIG. 3. Actual decision costs for one-speaker detection. This bar graph shows the contributions
of each error type to the total detection cost for each system. The trials used were those satisfying
the one-speaker detection primary condition.
FIG. 4. DET curves for one-speaker detection. This plot shows DET curves, encompassing the
full range of operating points, for each system. For each curve, the actual decision operating point is
marked with a diamond, while the minimum detection cost operating point is marked with an open
circle.
Martin and Przybocki: NIST 1999 Speaker Recognition Evaluation
9
FIG. 5. DET curves for two-speaker detection. This plot shows DET curves for the five
participating systems on the primary condition trials.
FIG. 6. DET curves for speaker tracking. This plot show DET curves for the six participating
systems on the primary condition trials. The ranges included on each axis have been expanded from
those in the other plots because of the generally low performance level on this task.
10
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
FIG. 7. One-speaker vs two-speaker detection. Performance is shown for each of two systems on
the detection tasks using the same underlying primary condition data. Both the actual two-speaker
performance and that predicted from the one-speaker performance are shown for each system.
is to seek to separate the speech of the two speakers (this is relevant to the
tracking task as well) and then apply one-speaker detection to each separated
set of segments. The hypothesized speaker is declared present in the combined
segment if and only if it is declared present in either of the two separated
segment sets. Such an approach was used by at least some automatic systems [8]
and also by a human attempting the task. If a one-speaker detection operating
point has error probabilities (p1FA, p1M) then, based on these probabilities
and an independence assumption, one might predict for the two-speaker task
the operating point (p2FA, p2M) where
p2FA = 1 − (1 − p1FA)2 = (2 · p1FA) − p1FA2
p2M = p1M · (1 − p1FA) = p1M − (p1M · p1FA).
(4)
These ideas are explored in Fig. 7. For two evaluation systems, we present DET
curves which show, for the same underlying speech segments (the two-speaker
primary condition segments), actual performance results for the one-speaker
task, predicted performance results based on the above approach for the twospeaker task, and actual performance results for the two-speaker task. We see
that this approach predicts part, but not the greater part, of the performance
degradation in going from the one-speaker task to the two-speaker task, and
Martin and Przybocki: NIST 1999 Speaker Recognition Evaluation
11
FIG. 8. Official winners. Shown are the plaques that were presented to the system developers of
the official winning systems for each of the three evaluation tasks. These winners were the systems
with the lowest actual decision detection cost for each task on primary condition trials.
other factors must also be involved. Some further information on the nature of
the increased difficulty of the two-speaker detection task may be inferred from
the discussion of duration in Section 6.
Figure 6 shows that speaker tracking is a very difficult task for current
systems. Performance varied little across the systems participating and appears
to be not far in excess of random. For this reason the upper limit on the
probability scales in the figure is higher than that in the preceding figures. While
part of the difficulty is caused by inexactness in the automatically determined
speech boundaries, analysis shows this not to be a major factor in the poor
performance. Further research on how to approach this task, and how best to
score performance on it, is clearly appropriate.
While performance differences among the best performing systems were not
especially large and the evaluation was not intended primarily as a competition,
it was desired to give formal recognition to the systems that did best on each
task as determined by the official CDet actual decision metric. Financing was
not available this year for cash awards for the researchers involved. Instead,
handsome plaques, as shown in Fig. 8, were created and presented to the lead
investigators of the best performing system for the primary condition for each
of the three tasks. These winners were:
• Dragon Systems, Inc., Fred Weber, one-speaker detection task,
• MIT–Lincoln Laboratory, Douglas Reynolds, Bob Dunn, and Tom
Quatieri, two-speaker detection task,
• MIT–Lincoln Laboratory, Douglas Reynolds, Bob Dunn, and Tom
Quatieri, speaker tracking task.
6. FACTORS AFFECTING PERFORMANCE
Having access to the raw submission results of all participants gives NIST
the opportunity to analyze these results in order to explore various factors
that may affect recognition performance. Over the years NIST has explored
the significance of numerous data, speaker, and channel attributes on overall
12
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
FIG. 9. Effect of segment duration for one-speaker detection. Performance is shown by duration
ranges for segments satisfying the one-speaker primary conditions other than segment duration. It
is seen that performance is considerably poorer for segments shorter than 15 segments, but that for
longer segments duration does not affect performance.
performance. Here we discuss several of these that appear to have been of
importance in the 1999 evaluation.
6.1. Duration
Previous NIST evaluations included separate tests in one-speaker detection
for segments of differing duration, namely 30, 10, and 3 s. These showed, as
might be expected, that performance was significantly greater for 30-s segments
than for 10-s segments, while performance on 10-s segments significantly
exceeded that on 3-s segments. This year the one-speaker segments averaged
30 s in duration but varied over a continuous range up to 1 min.
Figure 9 displays DET curves of one-speaker detection performance by ranges
of segment duration for one system. The results shown are representative of
those for all of the systems in the evaluation. They indicate that, at least for
systems of the type used in this evaluation, performance is significantly lower
for segments of shorter than 15 s in duration, but that for segments longer than
15 s duration does not greatly affect the performance level. This is consistent
with the finding of previous years, but indicates that the duration effect seen
then is limited and that once some minimum duration, apparently in the 10–15 s
Martin and Przybocki: NIST 1999 Speaker Recognition Evaluation
13
FIG. 10. Effect of segment duration for two-speaker detection. Performance by duration ranges
is shown, as in Fig. 9, for two-speaker detection primary condition segments. Here duration matters
until the segment length exceeds 35 s.
range, is available, the amount of test speech ceases to be a major factor in
performance.
Figure 10 presents a somewhat similar plot for the two-speaker detection task
for one system. For this task all segments have a duration of about 1 min.
The individual plots therefore each include all impostor tests but are limited
to target tests where the durations of speech by the true speaker are in the
specified ranges. Here it is seen that duration does matter for performance, at
least up to the two highest ranges presented, involving target speaker durations
in excess of 35 s. This held for all participating systems. For the two highest
duration ranges the performance differences were small and not consistent in
direction among the several participating systems. This suggests that target
speaker duration matters at least up to the point where it is significantly in
excess of the speech duration of the other speaker in the test segment.
6.2. Pitch
Pitch would appear to be an important factor in speaker recognition,
but attempts to specifically include it in algorithms have had only limited
success [4]. NIST has investigated ways in which average speaker pitch affects
performance in each of the past several years. Results have not been consistent
on how performance is affected by limiting consideration within each sex to
14
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
FIG. 11. Effect of target pitch differences. Shown for one system is the primary condition
performance and performance when target trials are restricted to those with the 25% closest and
farthest average pitch differences between the training and test data. The effect of such differences
on performance is quite large.
speakers of particularly high or low pitch [5, 6]. We have found, as might be
expected, that limiting impostor trials to instances where the impostor’s average
pitch is close to that of the hypothesized target (in the training data), while
including all target trials, degrades performance. But, perhaps surprisingly, a
bigger effect is observed when target trials are restricted to those with the
largest pitch differences between the training and test segments, while all
impostor trials are included.
Figure 11 gives an example of a typical system in the 1999 evaluation. For
each speaker, the average pitch of the training data and of each test segment
was estimated. The plot shows for the one-speaker detection task, a curve of all
primary condition tests, and curves limited to target trials where the log pitch
difference between the test segment and the target speaker training data are in
the high and low 25% of all such differences. Large pitch differences in target
trials may correspond to instances where the speaker had a cold or was feeling
particularly emotional during either the training or test conversation. Note how
large the performance differences are. For example, at a 10% miss rate, the false
alarm rate is around 4% when all trials are included. When target trials are limited to the 25% that are closest in pitch, the false alarm rate is less than 1%;
when limited to the 25% farthest in pitch, the false alarm rate exceeds 10%.
Martin and Przybocki: NIST 1999 Speaker Recognition Evaluation
15
FIG. 12. Effect of same–different lines for training–test. Performance for a typical system is far
higher when target trials are restricted to those involving the same phone line (and presumably
handset) in training as in test as opposed to those involving different lines.
6.3. Handset Differences
The variation in telephone handsets is a major factor affecting the performance of speaker recognition using telephone speech. For the Switchboard-2
Corpus specific handset information is not provided, but telephone line information (i.e., the phone number) is. We can generally assume that if two calls are
from different lines the handsets are different; if they are from the same line
the handsets are probably the same.
We have chosen to concentrate in this evaluation on target trials where
the training and test segment lines are different. This is certainly the harder
problem. Moreover, using same-line calls is in a way unfairly easy, since for
impostor trials the training and test segment handsets are always different
as, with rare exceptions, speakers do not share handsets. Thus, using same
line target trials could be viewed as handset recognition rather than speaker
recognition.
Figure 12 shows, for one system, the large performance difference in this
evaluation between using same line and different line target trials. The impostor
trials are the same in both cases.
16
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
FIG. 13. Handset type distribution in corpus. Received calls generally involved electret
handsets, but initiated calls are divided between handset types. Presumably this is because the
initiators often used pay phones. Females apparently used fewer carbon button handsets than males.
This may indicate a bias in the handset type determination algorithm.
6.4. Handset Type
Most standard telephone handset microphones are of either the carbonbutton or electret type. We have seen in recent evaluation that the handset types
(i.e., the microphone types) used, both in the training and in the test segments,
can greatly influence recognition performance.
Handset-type information for this evaluation was determined using the MIT–
Lincoln Lab automatic handset labeler [7]. This is a software package that uses
the input telephone speech signal from one channel to assign a likelihood that
the signal is from a carbon-button handset as opposed to an electret handset.
This likelihood was converted into a hard decision (carbon or electret). This
hard decision was made available to the systems processing the training and test
segments. The decisions made were certainly less than perfect, as occasionally
conversations from the same phone number were assigned opposite type, but
are believed to be generally accurate.
Figure 13 provides some information on the distribution of handset types in
the conversations used in the evaluation. The main point to note is that the
received calls generally from home phones overwhelmingly involve electrettype handsets, while the initiated calls often made from public phones are
split between electret and carbon-button type. There is also evidence that
conversation sides involving female speakers are more likely to be declared
to be of electret type. This may indicate a slight bias in the automatic typedetection algorithm. It may also help to explain the slightly better overall
performance of most systems on male speakers compared to that on female
speakers.
Martin and Przybocki: NIST 1999 Speaker Recognition Evaluation
17
FIG. 14. Performance as a function of training–test handset. Performance for one system
on different number tests for each combination of training and test handset types. Performance
improves when the types match and is superior for electret handsets.
Figure 14 shows the variation in performance for different combinations of
training and test segment handset types for a typical system. All target trials
here are different line. There is a considerable advantage to having matching
handset types and particularly to having electret-type handset microphones in
both the training and the test segments.
7. FUTURE PLANS
The 2000 NIST Speaker Recognition Evaluation is being planned. It is expected to reuse selected data from the various parts of the existing Switchboard corpora. It may also use a Switchboard-type corpus of cellular telephone
data now being collected. It was suggested at the last workshop that some nonEnglish data be included. We are checking for the possible availability of suitable
such data.
We expect the three tasks of the 1999 evaluation to reappear in the 2000
evaluation. Some consideration is being given to variations of the speaker
tracking task.
To learn more about future evaluation plans or to arrange to participate in
the next evaluation, please visit the NIST website [1] or contact the first author.
18
Digital Signal Processing Vol. 10, Nos. 1–3, January/April/July 2000
REFERENCES
1. http://www.nist.gov/speech/spkrinfo.htm
2. Linguistic Data Consortium, University of Pennsylvania, 3615 Market Street, Suite 200,
Philadelphia, PA 19104-2608. http://morph.ldc.upenn.edu.
3. Martin, A., et al., The DET curve assessment of detection task performance. In Proc.
EuroSpeech, Vol. 4, 1997, pp. 1895–1898.
4. Doddington, G., et al., NIST Speaker Recognition Evaluation—Overview, methodology, system,
results, perspective, Speech Commun., in press.
5. Przybocki, M., and Martin, A., NIST Speaker Recognition Evaluation—1997. In RLA2C,
Avignon, April 1998, pp. 120–123.
6. Przybocki, M., and Martin, A., RLA2C presentation, Avignon, April 1998, NIST Speaker
Recognition Evaluations: Review of the 1997 & 1998 Evaluations, http://www.nist.gov/speech/
rla2c_pres/index.htm.
7. Quatieri, T., Reynolds, D., and O’Leary, G., Magnitude-only estimation of handset nonlinearity
with application to speaker recognition. In Proceedings ICASSP, 1998, pp. 745–748.
8. Dunn, R. B., Reynolds, D., and Quarteri, T. F., Approaches to speaker detection and tracking in
conversational speech, Digital Signal Process. 10 (2000), 93–112.
9. The ELISA Consortium, The ELISA systems for the NIST’99 evaluation in speaker detection
and tracking, Digital Signal Process. 10 (2000), 143–153.
ALVIN MARTIN is coordinator of the annual speaker recognition evaluations at NIST. He has a
Ph.D. in mathematics from Yale University. Contact him at alvin.martin@nist.gov.
MARK PRZYBOCKI is a computer scientist at NIST. He has a M.S. in computer science from
Hood College. Contact him at mark.przybocki@nist.gov.
Download