INTERSPEECH 2010 Simple and E fficient Speaker Comparison us ing Approximate KL Diverg ence* W. M. Ca mpbell†, Z . N . K a r a m †‡ †MIT Lincoln Laboratory, Lexington, MA ‡DSPG, Research Laborat ory o f E l ect roni cs at M IT, Cam b ri dge M A Abstract We descri be a s i m pl e, novel , and effi ci ent s yst em f or speaker compar i s on w i t h t wo m ai n c omponent s. F i r s t , t he s yst e m uses a new appr oxi mat e K L dive rgence di st ance ext e ndi ng ear l i e r GMM paramet er vect or S V M kernel s . T he approxi mat e di st a nce i ncor por at es dat a - dependent mi xt ur e w ei ght s a s w el l a s t he s t a ndar d MA P - adapt e d G MM mean par a met e r s . S econd, t he s yst e m a ppl i e s a w e i ght ed nui sance pr oj ect i on m et hod f or channel c ompensat i on. A s i m pl e e i genvect or met hod of t r ai ni ng i s present ed. T he resul t i ng speaker compari s on syst em i s st r a i ght f or war d t o i mpl e ment and i s c omput at i onal l y si mpl e — onl y t wo l ow - r a nk mat r i x mul t i pl i es and a n i nner pr oduct a r e needed for compari son of t wo GMM paramet er vect ors. We demonst r at e t he appr oach on a N I S T 2008 speaker r ecogni t i on eval uat i on t a sk. We pr ovi de i nsi ght i nt o w hat met hods, par amet er s, and f eat ur es ar e c r i t i cal f or good per f or m ance. Index Terms: s peaker recogni t i on 1. Introduction Text - i ndependent speaker compar i s on i s t he pr ocess of t aki ng t wo s peech ut t erances and provi di ng a m at ch score or post eri or pr obabi l i t y of mat c h. S peaker compar i s on can be consi der ed t o be a c or e bui l di ng bl ock f or bui l di ng s peaker r ecogni t i on syst e ms. S t a ndar d appr oaches t o compar i s on i ncl ude t r ai ni ng and testing using a classifier or building speaker ut terance kernels, e.g. [1, 2 ]. S peaker compari s on can be i m pl ement ed usi ng many di ff e r e nt cl assi fi e r s . We f ocus on appr oaches usi ng a G MM uni ve r s al backgr ound model ( G MM U B M) . S peaker compar i s on i s accompl i shed usi ng S VM ke rnel t echni ques [ 2 ] . I n t hi s s t r uct ure, a GMM U BM i s adapt ed per ut t erance and t he r esul t i ng m odel s ar e c ompar e d usi ng an appr oxi mat e K L dive rgence. T hi s f r a mewor k i s si mpl e and i nt ui t ive f or s peaker r ecogni t i on si nce ut t erances are r epresent ed usi ng G MM paramet er vect ors and s peaker compar i s on i s a s i m pl e i nner pr oduct . S i gni fi cant i mprovement s i n error r at es for s peaker compari s on can be obt ai ned by usi ng dat a - dr ive n s ubspace model s f or channel a nd speaker represent a t i on. Two s i gni fi cant a pproaches ar e nui sance a t t r i but e pr oj ect i on ( NA P ) and j oi nt fact or anal ysi s ( JFA ) . NA P [ 2] uses a fi xed or t hogonal pr oj ect i on t o r emove nui sance di r ect i ons from t he GMM paramet er vect or. Typi cal l y, t hi s nui sance i s m odel e d a s s essi on va r i at i on. JFA [ 3] model s bot h t he speaker and s essi on vari at i on w i t h subspaces. Fact ors (coordi nat es) for t he subspaces are derived usi ng a MAP cri t er i on w i t h a pr i or on t he fact or s. C ombi ni ng c ompar i son m et hods w i t h subspace met hods wa s s t udi ed ext e nsivel y i n t he i nner pr oduct di s cr i m i nant f unc*T h is wo rk was s pons ored by the F ederal Bureau of Inves tigation tion (IPDF) framework [4]. IP D F s c onsi der ed numer ous combi nat i ons of cl assi fi e r s and c ompensat i on m et hods and f ound t wo key aspect s of good per f or m ance. F i r s t , cl assi fi cat i on m et hods i ncor por at i ng bot h speaker-dependent mean and m i xt urew e i ght par a met e r s gave si gni fi cant i mpr ove ment ove r m eanonl y s yst e ms. S econd, subspace channel c ompensat i on pr ovi ded t he bul k of syst em per f or m ance i m pr ove ment s. I n t hi s paper, w e pr esent a n a ppr oxi mat e K L dive rgence kernel combi ned wi t h wei ght ed NA P ( W NAP ) [ 5] t hat i m pl ement s t he key i nsi ght s f r om t he I P D F f r a mewor k. O ur st r a t egy i s t o focus on an easy t o i mpl ement syst em t hat i s effi ci ent and achi eve s s t a t e - of - t he- per f or m ance. A n added bonus of our r e sul t i s t hat our r e sul t i ng met hod i s an S V M ker nel a nd can be used i n f ut ur e wor k f or ot her speaker r ecogni t i on t a sks. I n t hi s paper, w e fi r st cove r t he t op- l eve l s peaker compari s on f r a mewor k i n S ect i on 2, t hen w e pr e sent an appr oxi mat e K L - divergence m et hod i n S ect i on 3. S ect i on 4 di scusses W NA P and t he cor r e spondi ng t r ai ni ng cr i t er i on. S ect i on 5 pr esent s al gor i t hms f or speaker compar i s on met hod. F i nal l y, exper i ment s i n S ect i on 6 demonst r at e t he effect iveness of t he met hod and pr ovi de i nsi ght i nt o key m et hods f or achi evi ng good per f ormance. 2. GMM Parameter Vectors A s t a ndar d di st r i but i on used f or t ext - i ndependent speaker r ecogni t i on i s t he G a ussi an mi xt ur e m odel [ 6] , N g ( x ) = i=1XλiN (x |mi,Si). (1) F eat ure vect ors are t ypi cal l y cepst ral coeffi ci ent s wi t h associ at ed smoot hed fi r s t - and s econd- or der der ivat ives. A s equence of f eat ur e vect or s, X =(x i), from a s peaker i s mapped t o a G MM by a dapt i ng a G MM unive r s al backgr ound model ( U B M) . We w i l l assume onl y t he mi xt ur e w e i ght s, λ, and means, mi1, · · · , x Nxx, w h e r e , i n ( 1) ar e a dapt ed. A dapt at i on of the means i s performed with standard relevance MAP [6]. We estimate the mixture weight s using the standard ML estimate. The adapt at ion yields new parameters which we stack into a par a met e r vect or, p (2) =ˆ λx,1· · · λ . (3) S peaker compari s on i s t he process of compari ng t wo s exquences of f eat ur e vect or s, X and Y . R at her t han c ompar e t hese di r ect l y, w e c ompar e t he c or r e spondi ng par a met e r vectors, pand px,pyy, obt ai ned from s eparat el y adapt i ng t he GMM UBM t o X and Y . T he goal i s t o pr ovi de a c ompar i son f unction C(p)t hat pr oduces a val ue r e fl ect i ng t he si mi l a r i t y of t he s peakers r epresent ed by t he t wo paramet er vect ors. un de r A ir Fo rce Co ntr act FA 8721-0 5-C-00 02. Opinio ns , interpr etation s, concl us ions , and recom m endati ons are tho se of the aut hor s and are not neces s arily endor s ed b y the Unite dS tates G p overnm ent. x = ˆ λt x mt x˜t x,N mt x,1 ·· · mt x,N 362Copyright © 2010 ISCA 26-30 September 2010, Makuhari, Chiba, Japan ˜t where DU i s a l i near l y i ndependent set , and t he met ricis 3. Approximate KL Diverg ence A n obvi ous st r a t egy f or c ompar i ng t he G MM par a met = Dx- Dy2. ( 12) x- yD e r vect or s i s t o use t he K L dive rgence bet w een t he di s t r i but i ons T he process of proj ect i on, e. g. y = PU, Db, i s e quiva l e nt t D(gxgy) = Z Rn gx( x ) l o g gx(x ) dx . (4) gy(x ) « o sol vi ng t he l east - squar e s pr obl em, „ D ˆx Usi ng t he KL divergence di rect l y i s di ffi cul t because i t =argmi cannot be comput ed i n cl osed f or m . T her e f or e , a n a n ppr oxi mat e K L di vergence has been used successful l y i n speaker recogni t i on [2]. x Ux- b D(gxgy) = i=1X A n appr oxi mat i on based on t he l og- sum i nequal i t y i s appl i e d t o ( 4) t o spl i t out i ndivi dual m i xt ur e s t o obt ai n ( 13) U, D . I n pract i ce, t he proj ect i on ( r i x m ul t i pl y by or t honor ma subspace w i t h r e spect t o t he a ⊥ N t he form Qi. T he use of t hi s pr oj ect i on i s t o r For W NA P, w e use a gener a l educe nui sances pr esent i n t he expansi on pr oposed i n S o U ect i on 2. T he mai n assumpt i on i s t hat t he nui sance i s c onfi ned t o a “smal l ” di m ensi onal s ubspace of t he expansi on s pace. U, Dof i s f r om t he U B M. N ot e t hat w e have dr opped t he t e r m D(λ), s i nce w e ar e fi ndi ng an upper bound and t he K L divergence i s always great er t han zero. s(px,pyB y symmet r i z i ng ( 5) and s ubst i t ut i ng i n t he K L dive rgence bet w een t wo G aussi an di st ri but i ons, w e obt ai n a di st ance whi ch upper bounds t he s ymmet r i c K L dive rgence, d), y, i Here, Si S xλy N For t he W NA P t r ai ni ng set , w e assume t hat f or eve r y speaker (i n general , every cl ass), t hat w e can est i mat e a “l ow noi se” vect or ¯ x from w hi ch del t as can be cal cul at ed. In pract i ce, t hi s smoot hed vect or i s f or m ed by adapt i ng a m odel f r om t he dat a pool ed acr oss m ul t i pl e ut t e r a nces f r om t he same speaker. We t hen base our cr i t er i on on a ppr oxi mat i ng t hese del t a s. s,i i=1X(0.5λx,i - 1 i(mx,i- my, i ). (6) +0.5λ )(mx,i- my, i )t A c or r e spondi ng i nner pr oduct t o t hi s di s t a nce i s Mor e speci fi cal l y, s uppose w e have a t r a i ni ng s et , {zs} l a bel e d by s peaker, s, and i nst ance, i. For each s, wehavea smoot hed vect or, ¯ z . For W NA P t r ai ni ng, w e use t he f ol l ow i ng optimization probl em, +0.5λ y, i )mt x,iS- 1 imy, i . (7) N CKL(px,py) = i=1X(0.5λx,i s,i W s,iPU, Ds,ids,i min sX iX - ds,i 2 Ds,i s,i ( 14) s,i i KLNot e t hat (7) can al so be expressed m ore compact l y as C(px,py) = m t x((0.5λx+0.5 λ y)⊗ In1) S miy(8) where S is the block matrix with the Sf r om t he U B M on t he di agonal , n i s t he f eat ur e vect or di mensi on, and ⊗ is the Kronecker pr oduct . N ot e t hat shi f t i ng t he m eans by t he U B M w i l l not affect t he di s t ance i n ( 6), s o w e can repl ace means i n ( 8) by t he U BM cent ered means. where ds,i U= zs,i ¯ z s. T he W NA P t r ai ni ng cr i t er i on ( 14) i ncorporat es t he goal s of usi ng a vari abl e met r i c and an ut t erance dependent w e i ght i ng, W, see [5]. The t raining criterion attempts to find a subspace U t hat best appr oxi mat e s t he nui sance das i n pr i or wor k [ 2] . For t he pur poses of t hi s wor k, w e a ssume t hat D= D is a c onst a nt . P r i or wor k has s how n t hat t hi s i s a good compr omi se i n performance and comput at i onal effi ci ency [5]. In t he case of const a nt D, t he W NA P cr i t e r i on can be show n t o be e quiva l e nt t o t he f ol l ow i ng pr obl em. F i r st , w e i ⊗ In)my ⊗ C G M ncor por at e t he Wi nto the dby l e t t i ng, ˆ= v W i N ˆ diˆ dt i. ( 16) i=1 1 I ( p n x , p ) S ( λ X R= 1 / y ) = m 2 y t x ( λ 1 / 2 x (9) where S i s t he bl ock di agonal of t he U B M c ovar i a nces. I n exper i ment s, w e have f ound ( 9) t o be a good appr oxi mat i on of ( 8) . We ment i on t hat t he cor r e spondi ng S V M expansi on t o t he T he c ompar i son f unct i on, CKLKL, does not cor r e spond t o an i nner pr oduct i n t he Mer cer sense. T hat i s , w e cannot separ a t e Ci nt o an i nner pr oduct of t he f or m b(px)tb(py)where b(·) i s some mappi ng f unct i on. A s i m pl e s ol ut i on t o t hi s pr obl em i s t o repl ace t he ari t hmet i c m ean bet w een mi xt ure w ei ght s i n ( 8) wi t h a geomet r i c mean; w e obt ai n kernel (9) i s b(px) = ( λ 1/ 2 x ⊗ - 1/ 2)S x. ( 10) I m n ˆ U, ˆ U t ˆ U =I . ( 15) S econd, w e fi nd t he co ˆ T hen, t he cr i t e r i on ( 14) can be expr Uessed a s, max tr h tˆ R ˆ Ui ( 17) 4. WNAP U, DB I n t he e quat i on, U i s t he desi r ed nui sance s ubspace, ˆ U = DU, and ˆ R = DRD. T hi s pr obl em can be sol ved usi ng a n e i genvect or met hod t hat w i l l be pr esent e d i n S ect i on 5. I nt ui t ive l y, t he pr obl em ( 17) fi nds a l ow r a nk appr oxi mat i on, U,thatbest approxi mat e s t he nui sance s ubspace. e f or e defi ni ng W NA P, w e i nt r oduce s ome not at i on. We defi ne an or t hogonal pr oj ect i on w i t h r e spect t o a m et r i c, P, wh e r e D a n d U a r e f u l l r a n k m a t r i c e s a s PU, D = U(UtD2U)- 1UtD2 ( 11) 363 enr ol l , one conver sat i on ve r i fi cat i on t ask f or t e l e phone channel speech. T-Norm model s and Z -Norm s peech ut t Our m et hod for s peaker compari s on can be spl i t i nt o t wo erances were dr aw n f r om t he N I S T 2004 S R E c or pus. R component s— t r ai ni ng t he nui sance s ubspace and per f or m esul t s w e r e obt ai ned f or bot h t he E ngl i s h onl y ( E ng, i ng speaker compar i s on scor i ng. B ot h al gor i t hms a r e st r pool 7) and f or al l di ff e r e nt number t r i al s ( A l l , pool 6) w a i ght f orward to impl ement with matrix tool s such as Mat lab. hi c h i ncl udes s peaker s t hat e nr ol l / ve r i f y i n di ff er ent l iTr ai ni ng t he nui sance s ubspace i s show n i n A l gor i t hm 1. anguages. For t he t r ai ni ng set , w e fi r st not e t hat onl y t he ut t e r a nce F eat ure ext ract i on was performed usi ng H T K [9] w i t h MA P adapt e d m eans f or t he vect ors miar e used. A t ypi cal 20 MF CC coeffi ci ent s, del t as, and accel erat i on coeffi ci ent dat a set f or t r a i ni ng t he nui sance s ubspace woul d have s s for a t ot al of 60 feat ures. S peech act ivi t y det ect i on ( S ever al sessi ons per s peaker— t ypi cal l y 8 or mor e . A second AD) was performed usi ng a cascade of t wo syst ems. F i rst , a comment on A l gor i t hm 1 i s t hat t he met r i c , D, used f or t G MM speech/ non- speech det ect or wa s a ppl i e d. T hen, t r ai ni ng t he s ubspace i s not ut t e r a nce dependent . I n pr act hese S A D mar ks w er e post - pr ocessed w i t h a n e nergyi ce, t hi s has not i m pact ed per f or m ance. T hi r d, w e ment i based det ect or. F eat ur e s f r om non- speech f r a mes w er e e on t hat one good choi ce f or Wis t he number of s peech frames l i mi nat e d a nd t hen f eat ur e warping [10 ] was applied to all det ect ed by speech act ivi t y det ect i on. Four t h, w e m ent i of the resulting features with a 3 s econd w i ndow. on t hat i n t he a l gor i t hm, ker nel P C A can be used as an al t A G MM U B M w i t h 512 mi xt ur e c omponent s was t r ai e r nat ive t o t he di r ect par a met e r expansi on [ 7] . ned usi ng dat a f r om N I S T S R E 2004 and f r om S w i t c GMI n A l gor i t hm 2, w e show compensat i on usi ng W NA P hboar d corpor a. A nui sance s ubspace wa s t r a i ned usi ng t he and speaker compari s on scori ng usi ng C. N ot e t hat t he mat r speaker s f r om S w i t chboar d 2 a nd N I S T 2004 S R E c or i x D i s t he s ame a s i n A l gor i t hm 1. We al so ment i on t por a usi ng A l gor i t hm 1. T he di m ensi on of t he nui sance s hat t he r e l eva nce fact or f or MA P adapt a t i on can be t ubspace, U, was fi xed at 64. uned. Typi cal l y, we use a r el evance fact or of 0.01. A f ew aspect s of t he f r ont - e nd w e r e cr i t i cal f or t he best performance. F i rst , t he f ul l bandwi dt h MF CC anal ysi 6. Experi ments6. 1. S e t u p E xper i ment s w er e per f s , 0- 4kH z, per f or m ed t he best . I n our exper i ment s, w e f or m ed on t he N I S T 2008 speaker r ecogound t hat W NA P coul d t ake a dvant a ge of t he a ddi t i onal ni t i on eval uat i on ( S R E ) dat a s et . E nr ol l m ent / ve r i fi bandw i dt h f or s peaker compar i s on. S econd, our cascaded S cat i on met hodol ogy and t he eval uat i on c r i t e r i on, equal e A D i s fai r l y a ggr essive . We f ound t hat l ow- l eve l s peech r r or r at e ( E E R ) a nd mi nD C F, w er e based on t he N I S T wa s not hel pf ul i n di s cr i m i nat i on and c oul d c ont ai n c r S R E eval uat i on pl an [ 8 ] . T he m ai n f ocus of our eff or t wa oss- t a l k. F i nal l y, f eat ur e war pi ng wa s a sl i ght gai n ove s t he one conver sat i on r f eat ur e 0- 1mean and vari ance normal i zat i on.K L For our S V M s yst e m, w e used bot h t he mean- onl y K L kernel , K, 1/ 2 UBMA l gori t h m 1 W NA P subspace t r ai ni ng al gor i t hm descri bed i n [2 ] and t he new Ckernel from ( 9) . T he S V M f or a fi xe d metric, D =(λ⊗ In- 1/ 2)SI nput : Mean par a met e r backgr ound wa s c onst r uct e d f r om F i s her dat a. TNorm vect or s {mi}, w ei ght s {W i} , w i t h s p e a k e r l a b e l s { l }, and Z -Norm w ere performed i n t he s ame manner as t he and t he desi r ed corank Out put : N ui sance s ubspace, U fo r a l l CGMGMsyst em. A r e l eva nce fact or of 4 wa s u s e d f o r t h e si n uni que speaker s i n {lsi} do Find ¯ mi KK L kernel to match prior work and f or best performance. For t he CGMke r nel , a r e l eva nce fact or of 0.01 wa s used i n bot h t he ke r nel - onl y a nd S V M exper i ment s. 6. 2. R e su l t s T he fi r s t t wo l i nes of Tabl e 1 s how basel i ne s yst e ms and t hei r fo r a l l j in {j|lj= mjjs== s} ¯ m GMcomput e t i m e f r om benchmar ks. C omput at i on i s not show n f or fusi on syst em (MIT L L and MF CC+L P CC) do Let d because we are f ocusi ng on si ngl e s yst e m per f or m ance. T he next t hr ee l i nes of the t able cont rast the new Ckernel with the KK L ke r n e l from earl i er work. In t he t abl e, w e s ee t hat t he new CK L GMke r nel ( show n w i t h Z T- nor m) out per f or m s t he KK L kernel i n al l tasks. Note the K GMGMi ncorporat es t he ut t erance mi xt ur e w ei ght s a s a wa ker en d for en d en d for ˆR U= = eigs( ˆ R,corank) % ei gs produces t he ei genvect ors of y of di s count i ng uncer t a i n mi xt ur e component s i n t he i asi - 1ˆ Ut he l argest magni t ude ei genval ues U = D for R =0 fo r i DRD ˆ =1to N do nner pr oduct ; t hus, CK L al l ows a l ow er rel evance fact or t o be used. T he pr i or ker nel , K,ismoresensi t ive t o a “ noi sy” m odel t hat has mi xt ur e c omponent s t hat a r e uncert a i n and r equi res a hi gher r el evance fact or. 5. Algorithms R = R+W idj pi ni A l gori t h m 2 Compensat i on and S cori ng wi t h t he C1GMke r n e l I nput : Two sequences of f eat ur e vect or s, Xand X2 O ut put : C ompar i son s cor e , s fo r i =1to 2do T he next s et of exper i ment s w e ke rnel s i n an S V M c onfi gur at s demonstrate that the S VM trai ni performance of t he s peaker compa esul t m i ght be expect ed, si nce h or f or t he t a rget speaker give s no =par a met e r s of MA P a dapt ed U B M t o Xi Di 1/ 2 i ⊗ I - 1/ 2 364 en d for s = D2m2 mt 1D1 We al so performed si mpl e cal i brat i on and l i near fusi on n)S - n = ( λ exper i ment s w i t h our syst em. F i r st , w e exper i ment ed w i t h di f f e r e nt scor e nor mal i zat i ons, Z - N or m, T- N or m , e t c . S econd, w e demonst r at ed t he e ff ect of not usi ng E ngl i s h/ non- E ngl i s h c al i br a t i on ( w / o C a l i br at i on) [ 11 ] . T hi r d, w e f used t he MF C C ←m i= UU pdat e Ta bl e 1: A compar i s on of di ff er ent s yst e ms on t he N I S T S R E 2008, one conver sat i on t e l e phone t r ai n a nd t e st subset ( pool 6 a nd 7) . C omput e t i m e i s nor mal i zed t o a J FA basel i ne and i ncl udes c ompensat i on a nd i nner pr oduct onl y. B e st per f or m i ng s yst e ms ar e s how n in bol d f or reference.System EER minDCF EER minDCF C omput e All (%) All (× 100) Eng (%) Eng (× 100) timeB U T M F C C 2 0 S y s t e m [ 1 1 ] 5 . 7 1 2 . 9 5 2 . 8 5 1 . 4 0 1 . 0 0 M I T L L F u s e d S ys t e m [ 1 2 ] 7 . 0 0 3 . 60 3. 30 1. 60 KK L , rf =0.01 7. 22 3. 39 4. 40 2. 04 0. 08 KK L GM, rf =4 6. 11 3. 04 3. 34 1. 68 0. 08 CK L 5. 86 2. 89 3. 09 1. 57 0. 08 SVM K6. 51 3. 01 3. 51 1. 65 0. 86 SVM C6. 31 2. 95 3. 64 1. 57 0. 86 CGMGMw/o Z T-Norm 6. 83 3. 58 3. 75 1. 98 0. 08 CGMZ-Norm Only 6. 39 3. 12 3. 33 1. 64 0. 08 CGMT-Norm Only 6. 72 3. 31 3. 63 1. 70 0. 08 CGMw/o Calibration 6. 70 3. 68 3. 09 1. 57 0. 08 CGMGMLPCC 6. 18 2. 85 2. 99 1. 59 0. 08 CFu se LPCC+M F CC 5. 27 2. 58 2. 91 1. 36 system score l inearly with an L P CC-based system. T he L P C C f r ont end was a m i nor change of t he H T K confi gur at i on— 18 cepst ral coeffi ci ent s wi t h energy were used al ong wi t h del t as and accel erat i on f or a t ot al of 57 feat ures. T he resulting syst em al s o performed wel l and f used wi t h our base MF CC syst em t o achi eve subst a nt i a l l y bet t er per f or m ance. F usi on of mul t i pl e feat ure t ypes w i t h t he s ame s yst em has i mproved performance i n many syst ems [ 11, 13] . 6. 3. A n al ysi s We not e t hat our best per f or m i ng s yst e m, CGMwith ZT-Norm, performs w el l i n compari son t o ot her syst ems i n t he l i t erat ure [ 11 ] and [ 12 ] . A key poi nt of our cur r e nt wor k i s t hat our comput at ion and impl ementation i s simpl ified with respect to previous met hods. Our new CGMsystem with WNAP reduces complexity over ol der syst ems w i t h si mi l ar performance. F i rst , we have show n t hat t he S V M t r a i ni ng i n [ 12] i s not necessar y f or t he speaker compar i s on t a sk. S econd, our syst em does not r e qui r e the use of joint fact or anal ysi s ( J FA ) [ 3 ] . J FA r e qui r e s c onsi derabl y mor e r e sour ces i n bot h c or por a a nd comput at i on. For t he speaker subspace i n JFA, a l arge corpora i s need t o model i nt erspeaker va r i at i on. For c omput at i on, t he J FA syst em r e qui r e s t he sol ut i on of bot h s peaker and c hannel fact ors. Compensat i on and scor i ng w i t h J FA can be an or der of m agni t ude sl ow er [ 4 ] i n our benchmar ks. I n bot h t he S V M a nd JFA case, f ur t her r esear ch i s needed t o under s t a nd per f or m ance i n t a sks w her e mor e ( or l ess) speaker dat a i s avai l a bl e. 7. Conclusions A new ke r nel f or s peaker compar i s on based upon an appr oxi mat e KL divergence was present ed. We s howed t hat s ubspacebased c hannel c ompensat i on c oul d be t r a i ned and i mpl e ment ed w i t h a s i m pl e a l gor i t hm, W NA P. A n anal ysi s of va r i ous confi gur at i ons of t he s yst e m demonst r at ed t hat si mpl e appr oxi mat e K L scor i ng a nd W NA P pr oduced excel l e nt per f or m ance i n compar i s on t o S V M a nd JFA s yst e ms. S ever al met hods f or achi evi ng st at e-of-t he-art performance were present ed. 8. References [ 1 ] W. M . C am pbell, “G ener alized linear dis cr im inant s equence kernels for s peaker recognition,” in ICA SSP , 2002, pp. 161–164. 365 [ 2 ] W. M . C am pbell, D . E . S tur im , D . A . R ey nolds , and A. Solom onoff, “SVM bas ed s peaker verification u s ing a G M M s uper vector ker n el and NA P var iability com p ens ation,” in ICA SSP , 2006, pp. I97–I100. [3] P. Kenny, P. O uellet, N. Dehak, V. Gupta, and P. D um ouchel, “A s tudy of inter- s p eaker var iability in s p eaker ver ifi cation,” IEEE Trans actions on Audio, Speech and Language P roces s ing, 2008. [ 4 ] W. M . C am pbell, Z . K ar am , and D . E . S tur im , “S peaker com p aris on with inner p roduct d is crim inant f unctions ,” in A d vances in Neural Infor m ation P ro ces s ing Sys tems 2 2 , Y. Bengio, D. Schuur m ans , J . L aff er ty, C. K . I . William s , and A . Culotta, E ds . , 2009, pp. 207–215. [5] W. M . C am pbell, “Weighted nuis ance attribute p rojection,” in s ubm itted to P ro c. O d ys s ey 2010: T he Speaker and L anguage R ecognition Wor ks hop , 2010. [6] D ouglas A. Reynolds , T. F. Q uatieri, and R. Dunn, “Speaker verifi cation u s ing adapted G aus s ian m ixtur e m odels ,” D igital S ignal Processing, vol. 10, no. 1-3, pp. 19–41, 2000. [7] Bernhard Sch ¨olkopf, A lex J . S m o la, and Klaus - Robert M ¨uller, “Kernel p rincipal com ponent analys is ,” in A d vances in Kernel Methods , Bernhard Sch ¨olkopf, Chris topher J . C. Burges , and Alex ander J . S m o la, E ds . , pp. 327–352. MIT P res s , Cam bridge, Mas s achus etts , 1999. [8] M. A . P rzybocki, A. F. Martin, and A. N. L e, “NIST s p eaker r ecognition evaluations utilizing the M ixer cor por a— 2004, 2005, 2006,” IE E E Trans . o n Speech, Au dio, Lang. , vol. 15, no. 7, pp. 1951–1959, 2007. [9] J . O dell, D. Ollas on, P. Woodland, S. Yo ung, and J . J ans en, The HTK B ook for H TK V 2. 0, C am br idge U n iver s ity P r es s , Cam bridge, UK, 1995. [10] J . Pelecanos and S . S ridharan, “Feature warping for r obus t s p eaker ver ifi cation,” in P roc. o f Speaker Odys s ey Wor ks hop , 2001, pp. 213–218. [11] L . Burg et, V. H ubeika, O. Glem bek, M. Karafiat, M. Kockm ann, P. M atejka, P etr S chwar z, and J . C er nocky, “BU T s y s tem f o r N I S T 2008 s p eaker recognition evaluation,” in P roc. I nters p eech , 2009, pp. 2335–2338. [12] D. Sturim , W. M . Cam pbell, Z . Karam , D. A. Reynolds , and F. Richards on, “T he MIT L incoln L aboratory 2008 s peaker recognition s ys tem ,” in P roc. I nters p eech , 2009, pp. 2359–2362. [13] W. M. Cam pbell, D. E . Sturim , W. S hen, D. A. Reynolds , and J. Navr ´atil, “T he MIT - L L /IBM 2006 s p eaker recognition s ys tem : High-perform ance reduced-com plex ity recognition,” in ICA SSP , 2007, pp. IV–217–IV–220.