The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012 Outline • • • • • • Some Early History Evaluation Organization Performance Factors Metrics Progress Future 27 June 2012 Odyssey 2012 @ Singapore 2 Some Early History • Success of speech recognition evaluation – Showed benefits of independent evaluation on common data sets • Collection of early corpora, including TIMIT, KING, YOHO, and especially Switchboard – Multi-purpose corpus collected (~1991) with speaker recognition in mind – Followed by Switchboard-2 and similar collections • Linguistic Data Consortium created in 1992 to support further speech (and text) collections in US • The first “Odyssey” – Martigny 1994, followed by Avignon, Crete, Toledo, etc. • Earlier NIST speaker evaluations in ‘92, ’95 – ‘92 evaluation had several sites as part of DARPA program – ‘95 evaluation with 6 sites used some Switchboard-1 data – Emphasis was on speaker id rather than open set recognition 27 June 2012 Odyssey 2012 @ Singapore 3 27 June 2012 Odyssey 2012 @ Singapore 4 Martigny 1994 Varying corpora and performance measures made meaningful comparisons difficult 27 June 2012 Odyssey 2012 @ Singapore 5 Avignon 1998 19th February 1998: WORKSHOP RLA2C - Speaker Recognition **************************************************** • RLA2C - RLA2C - RLA2C - RLA2C - RLA2C - RLA2C * **************************************************** ------------------------------------la Reconnaissance Speaker du Locuteur Recognition et ses and its Applications Commercial Commerciales and Forensic et Criminalistiques Applications ------------------------------------------------AVIGNON 20-23 avril/april 1998 TIMIT was preferred corpus Soutenu / Sponsored by GFCP - SFA - ESCA - IEEE 27 June 2012 Odyssey 2012 @ Singapore Sometimes bitter debate over forensic capabilities 6 Avignon Papers 27 June 2012 Odyssey 2012 @ Singapore 7 Crete 2001 First official “Odyssey More emphasis on evaluation 2001: A Speaker Odyssey - The Speaker Recognition Workshop June 18-22, 2001 Crete, Greece 27 June 2012 Odyssey 2012 @ Singapore 8 27 June 2012 Odyssey 2012 @ Singapore 9 27 June 2012 Odyssey 2012 @ Singapore 10 Toledo 2004 ISCA Archive ODYSSEY 2004 - The Speaker and Language Recognition Workshop May 31 - June 3, 2004 Toledo, Spain First Odyssey with NIST SRE Workshop held in conjunction at same location First to include language recognition. Two notable keynotes on forensic recognition. Well attended. Odyssey held bi-annually since 2004. 27 June 2012 Odyssey 2012 @ Singapore 11 27 June 2012 Odyssey 2012 @ Singapore 12 Etc. – Odyssey 2006, 2008, 2010, 2012, … Odyssey 2008: The Speaker and Language Recognition Workshop Stellenbosch, South Africa January 21-24, 2008 Odyssey 2010: The Speaker and Language Recognition Workshop Brno, Czech Republic 28 June � 1 July 2010 27 June 2012 Odyssey 2012 @ Singapore 13 Organizing Evaluations • • • • Which task(s)? Key principles Milestones Participants 27 June 2012 Odyssey 2012 @ Singapore 14 Which Speaker Recognition Problem? • Access Control? – Text independent or dependent? – Prior probability of target high • Forensic? – Prior not clear • Person Spotting? – Prior probability of target low – Text independent • NIST evaluations concentrated on the speaker spotting task, emphasizing the low false alarm region of performance curve 27 June 2012 Odyssey 2012 @ Singapore 15 Some Basic Evaluation Principles • • • • Speaker spotting primary task Research system oriented Pooling across target speakers Emphasis on low false alarm rate operating point with scores and decisions (calibration matters) 27 June 2012 Odyssey 2012 @ Singapore 16 Organization Basics • Open to all willing participants • Research-oriented – Commercialized competition discouraged • Written evaluation plans – Specified rules of participation • Workshops limited to participants – Each site/team must be represented • Evaluation data sets subsequently published by the LDC 27 June 2012 Odyssey 2012 @ Singapore 17 27 June 2012 Odyssey 2012 @ Singapore 18 1996 Evaluation Plan (cont’d) 27 June 2012 Odyssey 2012 @ Singapore 19 1996 Evaluation Plan (cont’d) 1. PROC plots are ROCs plotted on normal probability error (miss versus false alarm) plots 27 June 2012 Odyssey 2012 @ Singapore 20 DET Curve Paper – Eurospeech ‘97 27 June 2012 Odyssey 2012 @ Singapore 21 Wikipedia DET Page 27 June 2012 Odyssey 2012 @ Singapore 22 Some Milestones • 1992 – DARPA program limited speaker identification evaluation • 1995 – Small identification evaluation • 1996 – First SRE in current series • 2000 – AHUMADA Spanish data, first non-English speech • 2001 – Cellular data, • 2001 – ASR transcripts provided • 2002 – FBI “forensic” database • 2002 – SuperSid Workshop following SRE • 2005 – Multiple languages with bilingual speakers 27 June 2012 Odyssey 2012 @ Singapore 23 Some Milestones (cont’d) • 2005 – Room mic recordings, cross-channel trials • 2008 – Interview data • 2010 – New decision cost function metric stressing even lower FA rate region • 2010 – High and low vocal effort, aging • 2010 – HASR (Human-Assisted Speaker Recognition) Evaluation • 2011 – BEST Evaluation, broad range of test conditions, included added noise and reverb • 2012 – Target Speakers Defined Beforehand 27 June 2012 Odyssey 2012 @ Singapore 24 Participation • Grew from fewer than a dozen to 58 sites in 2010 • MIT (Doug) provided workshop notebook covers listing participants • Big increase in participants after 2001 • Handling scores of participating sites becomes a management problem 27 June 2012 Odyssey 2012 @ Singapore 25 NIST 2004 Speaker Recognition Workshop 27 June 2012 Odyssey 2012 @ Singapore Taller de Reconocimiento de Locutor 26 27 June 2012 Odyssey 2012 @ Singapore 27 Participating Sites Number of Sites 70 60 50 40 30 20 10 0 92* 95* 96 97 98 99 00 01 02 03 04 05 06 08 10 11* 12# * Not in SRE series 27 June 2012 Number of Sites Odyssey 2012 @ Singapore # Incomplete 28 This slide is from 2001: A Speaker Odyssey in Crete 27 June 2012 Odyssey 2012 @ Singapore 29 NIST Evaluation Data Set (cont’d) Year Common Condition(s) Evaluation Features 2002 One-session training on conv. phone data Cellular data, alternative tests of extended training, speaker segmentation, and a limited corpus of simulated forensic data 2003 One-session training on conv. phone data Cellular data, extended training 2004 Handheld landline conv. phone speech, English only Multi-language data with bilingual speakers 2005 English only with handheld tel. set Included cross-channel trials with mic. test, both sides of 2-channel convs. provided 2006 English only trials (including mic. test Included cross-channel trials with mic. trials) test 27 June 2012 Odyssey 2012 @ Singapore 30 NIST Evaluation Data Set (cont’d) Year Common Condition(s) 2008 8 – contrasting English and bilingual speakers, interview and conv. phone speech along with cross-condition trials Interview speech recorded over multiple mic channels and conv. phone speech recorded over mic and tel channels, multiple languages 2010 9 – contrasting tel and mic channels, interview and conversational phone speech, and high, low and normal vocal effort Multiple microphones, phone calls with high, low, and normal vocal effort, aging data (Greybeard), HASR 2012 5 – interview test without noise, conv. phone test without noise, interview test with added noise, conv. phone test with added noise, conv. phone test collected in noisy environment Target speakers specified in advance (from previous evals) with large amounts of training, some test calls collected in noisy environments, phone test data with added noise 27 June 2012 Evaluation Features Odyssey 2012 @ Singapore 31 Performance Factors • Intrinsic • Extrinsic • Parametric 27 June 2012 Odyssey 2012 @ Singapore 32 Intrinsic Factors Relate to the speaker – Demographic factors • Sex • Age • Education – Mean pitch – Speaking style • Conversational telephone • Interview • Read text – Vocal effort • Some questions about definition and how to collect – Aging • Hard to collect sizable amounts of data with years of time separation 27 June 2012 Odyssey 2012 @ Singapore 33 Extrinsic Factors Relate to the collection environment – Microphone or telephone channel – Telephone channel type • Landline, cellular, VOIP • In earlier times, carbon vs. electret – Telephone handset type • Handheld, headset, earbud, speakerphone – – – – Microphone type – matched, mismatched Placement of microphone relative to speaker Background noise Room reverberation 27 June 2012 Odyssey 2012 @ Singapore 34 “Parametric” Factors • Train/test speech duration – Have tested 10 s up to ~half hour, more in ‘12 • Number of training sessions – Have tested 1 to 8, more in ‘12 • Language English has been predominant, but a variety of others included in some evaluations – Is better performance for English due to familiarity and quantity of development data? – Cross-language trials a separate challenge 27 June 2012 Odyssey 2012 @ Singapore 35 Metrics • Equal Error Rate – Easy to understand – Not operating point of interest – Calibration matters • Decision Cost Function • CLLR • FA rate at fixed miss rate – E.g. 10% (lower for some conditions) 27 June 2012 Odyssey 2012 @ Singapore 36 Decision Cost Function CDet CDet = CMiss × PMiss|Target × PTarget + CFalseAlarm× PFalseAlarm|NonTarget × (1-PTarget) • Weighted sum of miss and false alarm error probabilities • Parameters are the relative costs of detection errors, CMiss and CFalseAlarm, and the a priori probability of the specified target speaker, PTarget. • Normalize by best possible cost of system doing no processing (minimum of cost of always deciding “yes” or always deciding “no” ) 27 June 2012 Odyssey 2012 @ Singapore 37 Decision Cost Function CDet (cont’d) • Parameters 1996-2008 CMiss CFalseAlarm PTarget 10 1 0.01 CMiss CFalseAlarm PTarget 1 1 0.001 • Parameters 2010 • Change in 2010 (for core and extended tests) met with some skepticism, but outcome appeared satisfactory 27 June 2012 Odyssey 2012 @ Singapore 38 CLLR C = 1/(2*log2) * ((Σlog(1+1/s)/N )+ (Σlog(1+s))/N )) llr TT NT where first summation is over target trials, second is over nontarget trials, NTT and NNT are the numbers of target and non-target trials, respectively, and s represents a trial’s likelihood ratio • Information theoretic measure made popular in this community by Niko • Covers broad range of performance operating points • George has suggested limiting range to low FA rates 27 June 2012 Odyssey 2012 @ Singapore 39 Fixed Miss Rate • Suggested in ‘96, was primary metric in BEST 2012: FA rate corresponding to 10% miss rate • Easy to understand • Practical for applications of interest • May be viewed as cost of listening to false alarms • For easier conditions, a 1% miss rate now more appropriate 27 June 2012 Odyssey 2012 @ Singapore 40 Recording Progress • Difficult to assure test set comparability – Participants encouraged to run prior systems on new data • Technology changes – In ‘96 landline phones predominated, with carbon button or electret microphones – Need to explore VOIP • With progress, want to make the test harder – Always want to add new evaluation conditions, new bells and whistles • More channel types, more speaking styles, languages, etc. – Externally added noise and reverb explored in 2011 with BEST • Doug’s history slide - updated 27 June 2012 Odyssey 2012 @ Singapore 41 History Slide 27 June 2012 Odyssey 2012 @ Singapore 42 Future • SRE12 • Beyond 27 June 2012 Odyssey 2012 @ Singapore 43 SRE12 Plans • Target speakers specified in advance – – – – • • • • Speakers in recent past evaluations (in the thousands) All prior speech data available for training Some new targets with training provided at evaluation time Test segments will include non-target speakers New interview speech provided in 16-bit linear pcm Some test calls collected in noisy environments Artificial noise added to some test segment data Will this be an effectively easier id task? – Will the provided set of known targets change system approaches? – Optional conditions include • Assume test speaker is one of the known targets • Use no information about targets other than that of the trial 27 June 2012 Odyssey 2012 @ Singapore 44 SRE12 Metric • Log-likelihood ratios will now be required – Therefore, no hard decisions are asked for • Primary metric will be an average of two detection cost functions, one using the SRE10 parameters, the other a target prior an order of magnitude greater (details on next slide) – Adds to stability of cost measure – Emphasizes need for good score calibration over wide range of log likelihoods • Alternative metrics will be Cllr and Cllr-M10, where the latter is Cllr limited to trials for which PMiss > 10% 27 June 2012 Odyssey 2012 @ Singapore 45 SRE12 Primary Cost Function • Niko noted that estimated llr’s making good decisions at a single operating point may not be effective at other operating points; therefore an average of two points is used • Writing DCF as PMiss + β * PFA where β = (CFA/CMiss) * (1 – PTarget) / PTarget • We take as cost function (DCF1 + DCF2)/2 where PTarget-1 = 0.01, PTarget-2 = 0.001, with always CMiss = CFA = 1 27 June 2012 Odyssey 2012 @ Singapore 46 Future Possibilities • SRE12 outcome will determine whether prespecified targets will be further explored – Does this make the problem too easy? • Artificially added noise and reverb may continue • HASR12 will indicate whether human-in-the-loop evaluation gains traction • SRE’s have become bigger undertakings – – – – Fifty or more participating sites Data volume approaching terabytes (as in BEST) Tens or hundreds of millions of trials Schedule could move to every three years 27 June 2012 Odyssey 2012 @ Singapore 47