Speaker Recognition Evaluations at NIST: 1996 -

The NIST Speaker Recognition
Evaluations
Alvin F Martin
alvinfmartin@gmail.com
Odyssey 2012 @ Singapore
27 June 2012
Outline
•
•
•
•
•
•
Some Early History
Evaluation Organization
Performance Factors
Metrics
Progress
Future
27 June 2012
Odyssey 2012 @ Singapore
2
Some Early History
• Success of speech recognition evaluation
– Showed benefits of independent evaluation on common data
sets
• Collection of early corpora, including TIMIT, KING, YOHO,
and especially Switchboard
– Multi-purpose corpus collected (~1991) with speaker
recognition in mind
– Followed by Switchboard-2 and similar collections
• Linguistic Data Consortium created in 1992 to support
further speech (and text) collections in US
• The first “Odyssey” – Martigny 1994, followed by Avignon,
Crete, Toledo, etc.
• Earlier NIST speaker evaluations in ‘92, ’95
– ‘92 evaluation had several sites as part of DARPA program
– ‘95 evaluation with 6 sites used some Switchboard-1 data
– Emphasis was on speaker
id rather than open set recognition
27 June 2012
Odyssey 2012 @ Singapore
3
27 June 2012
Odyssey 2012 @ Singapore
4
Martigny 1994
Varying corpora and
performance measures
made meaningful
comparisons difficult
27 June 2012
Odyssey 2012 @ Singapore
5
Avignon 1998
19th February 1998: WORKSHOP RLA2C - Speaker Recognition
****************************************************
• RLA2C - RLA2C - RLA2C - RLA2C - RLA2C - RLA2C
*
****************************************************
------------------------------------la Reconnaissance
Speaker
du Locuteur
Recognition
et ses
and its
Applications
Commercial
Commerciales
and Forensic
et Criminalistiques
Applications
------------------------------------------------AVIGNON 20-23 avril/april 1998 TIMIT was preferred corpus
Soutenu / Sponsored by
GFCP - SFA - ESCA - IEEE
27 June 2012
Odyssey 2012 @ Singapore
Sometimes bitter debate over
forensic capabilities
6
Avignon Papers
27 June 2012
Odyssey 2012 @ Singapore
7
Crete 2001
First official “Odyssey
More emphasis on evaluation
2001: A Speaker Odyssey - The Speaker
Recognition Workshop
June 18-22, 2001
Crete, Greece
27 June 2012
Odyssey 2012 @ Singapore
8
27 June 2012
Odyssey 2012 @ Singapore
9
27 June 2012
Odyssey 2012 @ Singapore
10
Toledo 2004
ISCA Archive
ODYSSEY 2004 - The
Speaker and Language
Recognition Workshop
May 31 - June 3, 2004
Toledo, Spain
First Odyssey with NIST SRE Workshop held in conjunction at same location
First to include language recognition.
Two notable keynotes on forensic recognition.
Well attended. Odyssey held bi-annually since 2004.
27 June 2012
Odyssey 2012 @ Singapore
11
27 June 2012
Odyssey 2012 @ Singapore
12
Etc. – Odyssey 2006, 2008, 2010, 2012, …
Odyssey 2008: The Speaker and
Language Recognition Workshop
Stellenbosch, South Africa
January 21-24, 2008
Odyssey 2010: The
Speaker and Language
Recognition Workshop
Brno, Czech Republic
28 June � 1 July 2010
27 June 2012
Odyssey 2012 @ Singapore
13
Organizing Evaluations
•
•
•
•
Which task(s)?
Key principles
Milestones
Participants
27 June 2012
Odyssey 2012 @ Singapore
14
Which Speaker Recognition Problem?
• Access Control?
– Text independent or dependent?
– Prior probability of target high
• Forensic?
– Prior not clear
• Person Spotting?
– Prior probability of target low
– Text independent
• NIST evaluations concentrated on the speaker spotting
task, emphasizing the low false alarm region of
performance curve
27 June 2012
Odyssey 2012 @ Singapore
15
Some Basic Evaluation Principles
•
•
•
•
Speaker spotting primary task
Research system oriented
Pooling across target speakers
Emphasis on low false alarm rate operating
point with scores and decisions (calibration
matters)
27 June 2012
Odyssey 2012 @ Singapore
16
Organization Basics
• Open to all willing participants
• Research-oriented
– Commercialized competition discouraged
• Written evaluation plans
– Specified rules of participation
• Workshops limited to participants
– Each site/team must be represented
• Evaluation data sets subsequently published
by the LDC
27 June 2012
Odyssey 2012 @ Singapore
17
27 June 2012
Odyssey 2012 @ Singapore
18
1996 Evaluation Plan (cont’d)
27 June 2012
Odyssey 2012 @ Singapore
19
1996 Evaluation Plan (cont’d)
1. PROC plots are ROCs plotted on normal probability error (miss versus false alarm) plots
27 June 2012
Odyssey 2012 @ Singapore
20
DET Curve Paper – Eurospeech ‘97
27 June 2012
Odyssey 2012 @ Singapore
21
Wikipedia DET Page
27 June 2012
Odyssey 2012 @ Singapore
22
Some Milestones
• 1992 – DARPA program limited speaker
identification evaluation
• 1995 – Small identification evaluation
• 1996 – First SRE in current series
• 2000 – AHUMADA Spanish data, first non-English
speech
• 2001 – Cellular data,
• 2001 – ASR transcripts provided
• 2002 – FBI “forensic” database
• 2002 – SuperSid Workshop following SRE
• 2005 – Multiple languages with bilingual speakers
27 June 2012
Odyssey 2012 @ Singapore
23
Some Milestones (cont’d)
• 2005 – Room mic recordings, cross-channel trials
• 2008 – Interview data
• 2010 – New decision cost function metric
stressing even lower FA rate region
• 2010 – High and low vocal effort, aging
• 2010 – HASR (Human-Assisted Speaker
Recognition) Evaluation
• 2011 – BEST Evaluation, broad range of test
conditions, included added noise and reverb
• 2012 – Target Speakers Defined Beforehand
27 June 2012
Odyssey 2012 @ Singapore
24
Participation
• Grew from fewer than a dozen to 58 sites in
2010
• MIT (Doug) provided workshop notebook
covers listing participants
• Big increase in participants after 2001
• Handling scores of participating sites becomes
a management problem
27 June 2012
Odyssey 2012 @ Singapore
25
NIST 2004
Speaker
Recognition
Workshop
27 June 2012
Odyssey 2012 @ Singapore
Taller de
Reconocimiento de
Locutor
26
27 June 2012
Odyssey 2012 @ Singapore
27
Participating Sites
Number of Sites
70
60
50
40
30
20
10
0
92* 95* 96 97 98 99 00 01 02 03 04 05 06 08 10 11* 12#
* Not in SRE series
27 June 2012
Number of Sites
Odyssey 2012 @ Singapore
# Incomplete
28
This slide is from 2001: A Speaker Odyssey in Crete
27 June 2012
Odyssey 2012 @ Singapore
29
NIST Evaluation Data Set (cont’d)
Year
Common Condition(s)
Evaluation Features
2002
One-session training on conv. phone
data
Cellular data, alternative tests of
extended training, speaker
segmentation, and a limited corpus of
simulated forensic data
2003
One-session training on conv. phone
data
Cellular data, extended training
2004
Handheld landline conv. phone
speech, English only
Multi-language data with bilingual
speakers
2005
English only with handheld tel. set
Included cross-channel trials with mic.
test, both sides of 2-channel convs.
provided
2006
English only trials (including mic. test Included cross-channel trials with mic.
trials)
test
27 June 2012
Odyssey 2012 @ Singapore
30
NIST Evaluation Data Set (cont’d)
Year
Common Condition(s)
2008
8 – contrasting English and bilingual
speakers, interview and conv. phone
speech along with cross-condition
trials
Interview speech recorded over
multiple mic channels and conv.
phone speech recorded over mic and
tel channels, multiple languages
2010
9 – contrasting tel and mic channels,
interview and conversational phone
speech, and high, low and normal
vocal effort
Multiple microphones, phone calls
with high, low, and normal vocal
effort, aging data (Greybeard), HASR
2012
5 – interview test without noise,
conv. phone test without noise,
interview test with added noise,
conv. phone test with added noise,
conv. phone test collected in noisy
environment
Target speakers specified in advance
(from previous evals) with large
amounts of training, some test calls
collected in noisy environments,
phone test data with added noise
27 June 2012
Evaluation Features
Odyssey 2012 @ Singapore
31
Performance Factors
• Intrinsic
• Extrinsic
• Parametric
27 June 2012
Odyssey 2012 @ Singapore
32
Intrinsic Factors
Relate to the speaker
– Demographic factors
• Sex
• Age
• Education
– Mean pitch
– Speaking style
• Conversational telephone
• Interview
• Read text
– Vocal effort
• Some questions about definition and how to collect
– Aging
• Hard to collect sizable amounts of data with years of time
separation
27 June 2012
Odyssey 2012 @ Singapore
33
Extrinsic Factors
Relate to the collection environment
– Microphone or telephone channel
– Telephone channel type
• Landline, cellular, VOIP
• In earlier times, carbon vs. electret
– Telephone handset type
• Handheld, headset, earbud, speakerphone
–
–
–
–
Microphone type – matched, mismatched
Placement of microphone relative to speaker
Background noise
Room reverberation
27 June 2012
Odyssey 2012 @ Singapore
34
“Parametric” Factors
• Train/test speech duration
– Have tested 10 s up to ~half hour, more in ‘12
• Number of training sessions
– Have tested 1 to 8, more in ‘12
• Language English has been predominant, but a
variety of others included in some evaluations
– Is better performance for English due to familiarity
and quantity of development data?
– Cross-language trials a separate challenge
27 June 2012
Odyssey 2012 @ Singapore
35
Metrics
• Equal Error Rate
– Easy to understand
– Not operating point of interest
– Calibration matters
• Decision Cost Function
• CLLR
• FA rate at fixed miss rate
– E.g. 10% (lower for some conditions)
27 June 2012
Odyssey 2012 @ Singapore
36
Decision Cost Function CDet
CDet = CMiss × PMiss|Target × PTarget
+ CFalseAlarm× PFalseAlarm|NonTarget × (1-PTarget)
• Weighted sum of miss and false alarm error probabilities
• Parameters are the relative costs of detection errors, CMiss
and CFalseAlarm, and the a priori probability of the specified
target speaker, PTarget.
• Normalize by best possible cost of system doing no
processing (minimum of cost of always deciding “yes” or
always deciding “no” )
27 June 2012
Odyssey 2012 @ Singapore
37
Decision Cost Function CDet (cont’d)
• Parameters 1996-2008
CMiss
CFalseAlarm
PTarget
10
1
0.01
CMiss
CFalseAlarm
PTarget
1
1
0.001
• Parameters 2010
• Change in 2010 (for core and extended tests) met with
some skepticism, but outcome appeared satisfactory
27 June 2012
Odyssey 2012 @ Singapore
38
CLLR
C = 1/(2*log2) * ((Σlog(1+1/s)/N )+ (Σlog(1+s))/N ))
llr
TT
NT
where first summation is over target trials, second is over nontarget trials, NTT and NNT are the numbers of target and non-target trials,
respectively, and s represents a trial’s likelihood ratio
• Information theoretic measure made popular in this
community by Niko
• Covers broad range of performance operating points
• George has suggested limiting range to low FA rates
27 June 2012
Odyssey 2012 @ Singapore
39
Fixed Miss Rate
• Suggested in ‘96, was primary metric in BEST
2012: FA rate corresponding to 10% miss rate
• Easy to understand
• Practical for applications of interest
• May be viewed as cost of listening to false
alarms
• For easier conditions, a 1% miss rate now
more appropriate
27 June 2012
Odyssey 2012 @ Singapore
40
Recording Progress
• Difficult to assure test set comparability
– Participants encouraged to run prior systems on new data
• Technology changes
– In ‘96 landline phones predominated, with carbon button or
electret microphones
– Need to explore VOIP
• With progress, want to make the test harder
– Always want to add new evaluation conditions, new bells and
whistles
• More channel types, more speaking styles, languages, etc.
– Externally added noise and reverb explored in 2011 with BEST
• Doug’s history slide - updated
27 June 2012
Odyssey 2012 @ Singapore
41
History Slide
27 June 2012
Odyssey 2012 @ Singapore
42
Future
• SRE12
• Beyond
27 June 2012
Odyssey 2012 @ Singapore
43
SRE12 Plans
• Target speakers specified in advance
–
–
–
–
•
•
•
•
Speakers in recent past evaluations (in the thousands)
All prior speech data available for training
Some new targets with training provided at evaluation time
Test segments will include non-target speakers
New interview speech provided in 16-bit linear pcm
Some test calls collected in noisy environments
Artificial noise added to some test segment data
Will this be an effectively easier id task?
– Will the provided set of known targets change system
approaches?
– Optional conditions include
• Assume test speaker is one of the known targets
• Use no information about targets other than that of the trial
27 June 2012
Odyssey 2012 @ Singapore
44
SRE12 Metric
• Log-likelihood ratios will now be required
– Therefore, no hard decisions are asked for
• Primary metric will be an average of two detection cost
functions, one using the SRE10 parameters, the other a
target prior an order of magnitude greater (details on
next slide)
– Adds to stability of cost measure
– Emphasizes need for good score calibration over wide
range of log likelihoods
• Alternative metrics will be Cllr and Cllr-M10, where the
latter is Cllr limited to trials for which PMiss > 10%
27 June 2012
Odyssey 2012 @ Singapore
45
SRE12 Primary Cost Function
• Niko noted that estimated llr’s making good decisions at a
single operating point may not be effective at other
operating points; therefore an average of two points is
used
• Writing DCF as
PMiss + β * PFA
where
β = (CFA/CMiss) * (1 – PTarget) / PTarget
• We take as cost function
(DCF1 + DCF2)/2
where
PTarget-1 = 0.01, PTarget-2 = 0.001, with always CMiss = CFA = 1
27 June 2012
Odyssey 2012 @ Singapore
46
Future Possibilities
• SRE12 outcome will determine whether prespecified targets will be further explored
– Does this make the problem too easy?
• Artificially added noise and reverb may continue
• HASR12 will indicate whether human-in-the-loop
evaluation gains traction
• SRE’s have become bigger undertakings
–
–
–
–
Fifty or more participating sites
Data volume approaching terabytes (as in BEST)
Tens or hundreds of millions of trials
Schedule could move to every three years
27 June 2012
Odyssey 2012 @ Singapore
47