Simple and Efficient Speaker Comparison using Approximate KL Divergence

INTERSPEECH 2010 Simple and E fficient Speaker Comparison us ing Approximate KL Diverg ence* W. M. Ca mpbell†, Z . N . K a r a m †‡ †MIT Lincoln Laboratory, Lexington, MA ‡DSPG, Research Laborat ory o f E l ect roni cs at M IT, Cam b ri dge M A Abstract We descri be a s i m pl e, novel , and effi ci ent s yst em f or speaker compar i s on w i t h t wo m ai n c omponent s. F i r s t , t he s yst e m uses a new appr oxi mat e K L dive rgence di st ance ext e ndi ng ear l i e r GMM paramet er vect or S V M kernel s . T he approxi mat e di st a nce i ncor por at es dat a - dependent mi xt ur e w ei ght s a s w el l a s t he s t a ndar d MA P - adapt e d G MM mean par a met e r s . S econd, t he s yst e m a ppl i e s a w e i ght ed nui sance pr oj ect i on m et hod f or channel c ompensat i on. A s i m pl e e i genvect or met hod of t r ai ni ng i s present ed. T he resul t i ng speaker compari s on syst em i s st r a i ght f or war d t o i mpl e ment and i s c omput at i onal l y si mpl e — onl y t wo l ow - r a nk mat r i x mul t i pl i es and a n i nner pr oduct a r e needed for compari son of t wo GMM paramet er vect ors. We demonst r at e t he appr oach on a N I S T 2008 speaker r ecogni t i on eval uat i on t a sk. We pr ovi de i nsi ght i nt o w hat met hods, par amet er s, and f eat ur es ar e c r i t i cal f or good per f or m ance. Index Terms: s peaker recogni t i on 1. Introduction Text - i ndependent speaker compar i s on i s t he pr ocess of t aki ng t wo s peech ut t erances and provi di ng a m at ch score or post eri or pr obabi l i t y of mat c h. S peaker compar i s on can be consi der ed t o be a c or e bui l di ng bl ock f or bui l di ng s peaker r ecogni t i on syst e ms. S t a ndar d appr oaches t o compar i s on i ncl ude t r ai ni ng and testing using a classifier or building speaker ut terance kernels, e.g. [1, 2 ]. S peaker compari s on can be i m pl ement ed usi ng many di ff e r e nt cl assi fi e r s . We f ocus on appr oaches usi ng a G MM uni ve r s al backgr ound model ( G MM U B M) . S peaker compar i s on i s accompl i shed usi ng S VM ke rnel t echni ques [ 2 ] . I n t hi s s t r uct ure, a GMM U BM i s adapt ed per ut t erance and t he r esul t i ng m odel s ar e c ompar e d usi ng an appr oxi mat e K L dive rgence. T hi s f r a mewor k i s si mpl e and i nt ui t ive f or s peaker r ecogni t i on si nce ut t erances are r epresent ed usi ng G MM paramet er vect ors and s peaker compar i s on i s a s i m pl e i nner pr oduct . S i gni fi cant i mprovement s i n error r at es for s peaker compari s on can be obt ai ned by usi ng dat a - dr ive n s ubspace model s f or channel a nd speaker represent a t i on. Two s i gni fi cant a pproaches ar e nui sance a t t r i but e pr oj ect i on ( NA P ) and j oi nt fact or anal ysi s ( JFA ) . NA P [ 2] uses a fi xed or t hogonal pr oj ect i on t o r emove nui sance di r ect i ons from t he GMM paramet er vect or. Typi cal l y, t hi s nui sance i s m odel e d a s s essi on va r i at i on. JFA [ 3] model s bot h t he speaker and s essi on vari at i on w i t h subspaces. Fact ors (coordi nat es) for t he subspaces are derived usi ng a MAP cri t er i on w i t h a pr i or on t he fact or s. C ombi ni ng c ompar i son m et hods w i t h subspace met hods wa s s t udi ed ext e nsivel y i n t he i nner pr oduct di s cr i m i nant f unc*T h is wo rk was s pons ored by the F ederal Bureau of Inves tigation tion (IPDF) framework [4]. IP D F s c onsi der ed numer ous combi nat i ons of cl assi fi e r s and c ompensat i on m et hods and f ound t wo key aspect s of good per f or m ance. F i r s t , cl assi fi cat i on m et hods i ncor por at i ng bot h speaker-dependent mean and m i xt urew e i ght par a met e r s gave si gni fi cant i mpr ove ment ove r m eanonl y s yst e ms. S econd, subspace channel c ompensat i on pr ovi ded t he bul k of syst em per f or m ance i m pr ove ment s. I n t hi s paper, w e pr esent a n a ppr oxi mat e K L dive rgence kernel combi ned wi t h wei ght ed NA P ( W NAP ) [ 5] t hat i m pl ement s t he key i nsi ght s f r om t he I P D F f r a mewor k. O ur st r a t egy i s t o focus on an easy t o i mpl ement syst em t hat i s effi ci ent and achi eve s s t a t e - of - t he- per f or m ance. A n added bonus of our r e sul t i s t hat our r e sul t i ng met hod i s an S V M ker nel a nd can be used i n f ut ur e wor k f or ot her speaker r ecogni t i on t a sks. I n t hi s paper, w e fi r st cove r t he t op- l eve l s peaker compari s on f r a mewor k i n S ect i on 2, t hen w e pr e sent an appr oxi mat e K L - divergence m et hod i n S ect i on 3. S ect i on 4 di scusses W NA P and t he cor r e spondi ng t r ai ni ng cr i t er i on. S ect i on 5 pr esent s al gor i t hms f or speaker compar i s on met hod. F i nal l y, exper i ment s i n S ect i on 6 demonst r at e t he effect iveness of t he met hod and pr ovi de i nsi ght i nt o key m et hods f or achi evi ng good per f ormance. 2. GMM Parameter Vectors A s t a ndar d di st r i but i on used f or t ext - i ndependent speaker r ecogni t i on i s t he G a ussi an mi xt ur e m odel [ 6] , N g ( x ) = i=1XλiN (x |mi,Si). (1) F eat ure vect ors are t ypi cal l y cepst ral coeffi ci ent s wi t h associ at ed smoot hed fi r s t - and s econd- or der der ivat ives. A s equence of f eat ur e vect or s, X =(x i), from a s peaker i s mapped t o a G MM by a dapt i ng a G MM unive r s al backgr ound model ( U B M) . We w i l l assume onl y t he mi xt ur e w e i ght s, λ, and means, mi1, · · · , x Nxx, w h e r e , i n ( 1) ar e a dapt ed. A dapt at i on of the means i s performed with standard relevance MAP [6]. We estimate the mixture weight s using the standard ML estimate. The adapt at ion yields new parameters which we stack into a par a met e r vect or, p (2) =ˆ λx,1· · · λ . (3) S peaker compari s on i s t he process of compari ng t wo s exquences of f eat ur e vect or s, X and Y . R at her t han c ompar e t hese di r ect l y, w e c ompar e t he c or r e spondi ng par a met e r vectors, pand px,pyy, obt ai ned from s eparat el y adapt i ng t he GMM UBM t o X and Y . T he goal i s t o pr ovi de a c ompar i son f unction C(p)t hat pr oduces a val ue r e fl ect i ng t he si mi l a r i t y of t he s peakers r epresent ed by t he t wo paramet er vect ors. un de r A ir Fo rce Co ntr act FA 8721-0 5-C-00 02. Opinio ns , interpr etation s, concl us ions , and recom m endati ons are tho se of the aut hor s and are not neces s arily endor s ed b y the Unite dS tates G p overnm ent. x = ˆ λt x mt x˜t x,N mt x,1 ·· · mt x,N 362Copyright © 2010 ISCA 26-30 September 2010, Makuhari, Chiba, Japan ˜t where DU i s a l i near l y i ndependent set , and t he met ricis 3. Approximate KL Diverg ence A n obvi ous st r a t egy f or c ompar i ng t he G MM par a met = Dx- Dy2. ( 12) x- yD e r vect or s i s t o use t he K L dive rgence bet w een t he di s t r i but i ons T he process of proj ect i on, e. g. y = PU, Db, i s e quiva l e nt t D(gxgy) = Z Rn gx( x ) l o g gx(x ) dx . (4) gy(x ) « o sol vi ng t he l east - squar e s pr obl em, „ D ˆx Usi ng t he KL divergence di rect l y i s di ffi cul t because i t =argmi cannot be comput ed i n cl osed f or m . T her e f or e , a n a n ppr oxi mat e K L di vergence has been used successful l y i n speaker recogni t i on [2]. x Ux- b D(gxgy) = i=1X A n appr oxi mat i on based on t he l og- sum i nequal i t y i s appl i e d t o ( 4) t o spl i t out i ndivi dual m i xt ur e s t o obt ai n ( 13) U, D . I n pract i ce, t he proj ect i on ( r i x m ul t i pl y by or t honor ma subspace w i t h r e spect t o t he a ⊥ N t he form Qi. T he use of t hi s pr oj ect i on i s t o r For W NA P, w e use a gener a l educe nui sances pr esent i n t he expansi on pr oposed i n S o U ect i on 2. T he mai n assumpt i on i s t hat t he nui sance i s c onfi ned t o a “smal l ” di m ensi onal s ubspace of t he expansi on s pace. U, Dof i s f r om t he U B M. N ot e t hat w e have dr opped t he t e r m D(λ), s i nce w e ar e fi ndi ng an upper bound and t he K L divergence i s always great er t han zero. s(px,pyB y symmet r i z i ng ( 5) and s ubst i t ut i ng i n t he K L dive rgence bet w een t wo G aussi an di st ri but i ons, w e obt ai n a di st ance whi ch upper bounds t he s ymmet r i c K L dive rgence, d), y, i Here, Si S xλy N For t he W NA P t r ai ni ng set , w e assume t hat f or eve r y speaker (i n general , every cl ass), t hat w e can est i mat e a “l ow noi se” vect or ¯ x from w hi ch del t as can be cal cul at ed. In pract i ce, t hi s smoot hed vect or i s f or m ed by adapt i ng a m odel f r om t he dat a pool ed acr oss m ul t i pl e ut t e r a nces f r om t he same speaker. We t hen base our cr i t er i on on a ppr oxi mat i ng t hese del t a s. s,i i=1X(0.5λx,i - 1 i(mx,i- my, i ). (6) +0.5λ )(mx,i- my, i )t A c or r e spondi ng i nner pr oduct t o t hi s di s t a nce i s Mor e speci fi cal l y, s uppose w e have a t r a i ni ng s et , {zs} l a bel e d by s peaker, s, and i nst ance, i. For each s, wehavea smoot hed vect or, ¯ z . For W NA P t r ai ni ng, w e use t he f ol l ow i ng optimization probl em, +0.5λ y, i )mt x,iS- 1 imy, i . (7) N CKL(px,py) = i=1X(0.5λx,i s,i W s,iPU, Ds,ids,i min sX iX - ds,i 2 Ds,i s,i ( 14) s,i i KLNot e t hat (7) can al so be expressed m ore compact l y as C(px,py) = m t x((0.5λx+0.5 λ y)⊗ In1) S miy(8) where S is the block matrix with the Sf r om t he U B M on t he di agonal , n i s t he f eat ur e vect or di mensi on, and ⊗ is the Kronecker pr oduct . N ot e t hat shi f t i ng t he m eans by t he U B M w i l l not affect t he di s t ance i n ( 6), s o w e can repl ace means i n ( 8) by t he U BM cent ered means. where ds,i U= zs,i ¯ z s. T he W NA P t r ai ni ng cr i t er i on ( 14) i ncorporat es t he goal s of usi ng a vari abl e met r i c and an ut t erance dependent w e i ght i ng, W, see [5]. The t raining criterion attempts to find a subspace U t hat best appr oxi mat e s t he nui sance das i n pr i or wor k [ 2] . For t he pur poses of t hi s wor k, w e a ssume t hat D= D is a c onst a nt . P r i or wor k has s how n t hat t hi s i s a good compr omi se i n performance and comput at i onal effi ci ency [5]. In t he case of const a nt D, t he W NA P cr i t e r i on can be show n t o be e quiva l e nt t o t he f ol l ow i ng pr obl em. F i r st , w e i ⊗ In)my ⊗ C G M ncor por at e t he Wi nto the dby l e t t i ng, ˆ= v W i N ˆ diˆ dt i. ( 16) i=1 1 I ( p n x , p ) S ( λ X R= 1 / y ) = m 2 y t x ( λ 1 / 2 x (9) where S i s t he bl ock di agonal of t he U B M c ovar i a nces. I n exper i ment s, w e have f ound ( 9) t o be a good appr oxi mat i on of ( 8) . We ment i on t hat t he cor r e spondi ng S V M expansi on t o t he T he c ompar i son f unct i on, CKLKL, does not cor r e spond t o an i nner pr oduct i n t he Mer cer sense. T hat i s , w e cannot separ a t e Ci nt o an i nner pr oduct of t he f or m b(px)tb(py)where b(·) i s some mappi ng f unct i on. A s i m pl e s ol ut i on t o t hi s pr obl em i s t o repl ace t he ari t hmet i c m ean bet w een mi xt ure w ei ght s i n ( 8) wi t h a geomet r i c mean; w e obt ai n kernel (9) i s b(px) = ( λ 1/ 2 x ⊗ - 1/ 2)S x. ( 10) I m n ˆ U, ˆ U t ˆ U =I . ( 15) S econd, w e fi nd t he co ˆ T hen, t he cr i t e r i on ( 14) can be expr Uessed a s, max tr h tˆ R ˆ Ui ( 17) 4. WNAP U, DB I n t he e quat i on, U i s t he desi r ed nui sance s ubspace, ˆ U = DU, and ˆ R = DRD. T hi s pr obl em can be sol ved usi ng a n e i genvect or met hod t hat w i l l be pr esent e d i n S ect i on 5. I nt ui t ive l y, t he pr obl em ( 17) fi nds a l ow r a nk appr oxi mat i on, U,thatbest approxi mat e s t he nui sance s ubspace. e f or e defi ni ng W NA P, w e i nt r oduce s ome not at i on. We defi ne an or t hogonal pr oj ect i on w i t h r e spect t o a m et r i c, P, wh e r e D a n d U a r e f u l l r a n k m a t r i c e s a s PU, D = U(UtD2U)- 1UtD2 ( 11) 363 enr ol l , one conver sat i on ve r i fi cat i on t ask f or t e l e phone channel speech. T-Norm model s and Z -Norm s peech ut t Our m et hod for s peaker compari s on can be spl i t i nt o t wo erances were dr aw n f r om t he N I S T 2004 S R E c or pus. R component s— t r ai ni ng t he nui sance s ubspace and per f or m esul t s w e r e obt ai ned f or bot h t he E ngl i s h onl y ( E ng, i ng speaker compar i s on scor i ng. B ot h al gor i t hms a r e st r pool 7) and f or al l di ff e r e nt number t r i al s ( A l l , pool 6) w a i ght f orward to impl ement with matrix tool s such as Mat lab. hi c h i ncl udes s peaker s t hat e nr ol l / ve r i f y i n di ff er ent l iTr ai ni ng t he nui sance s ubspace i s show n i n A l gor i t hm 1. anguages. For t he t r ai ni ng set , w e fi r st not e t hat onl y t he ut t e r a nce F eat ure ext ract i on was performed usi ng H T K [9] w i t h MA P adapt e d m eans f or t he vect ors miar e used. A t ypi cal 20 MF CC coeffi ci ent s, del t as, and accel erat i on coeffi ci ent dat a set f or t r a i ni ng t he nui sance s ubspace woul d have s s for a t ot al of 60 feat ures. S peech act ivi t y det ect i on ( S ever al sessi ons per s peaker— t ypi cal l y 8 or mor e . A second AD) was performed usi ng a cascade of t wo syst ems. F i rst , a comment on A l gor i t hm 1 i s t hat t he met r i c , D, used f or t G MM speech/ non- speech det ect or wa s a ppl i e d. T hen, t r ai ni ng t he s ubspace i s not ut t e r a nce dependent . I n pr act hese S A D mar ks w er e post - pr ocessed w i t h a n e nergyi ce, t hi s has not i m pact ed per f or m ance. T hi r d, w e ment i based det ect or. F eat ur e s f r om non- speech f r a mes w er e e on t hat one good choi ce f or Wis t he number of s peech frames l i mi nat e d a nd t hen f eat ur e warping [10 ] was applied to all det ect ed by speech act ivi t y det ect i on. Four t h, w e m ent i of the resulting features with a 3 s econd w i ndow. on t hat i n t he a l gor i t hm, ker nel P C A can be used as an al t A G MM U B M w i t h 512 mi xt ur e c omponent s was t r ai e r nat ive t o t he di r ect par a met e r expansi on [ 7] . ned usi ng dat a f r om N I S T S R E 2004 and f r om S w i t c GMI n A l gor i t hm 2, w e show compensat i on usi ng W NA P hboar d corpor a. A nui sance s ubspace wa s t r a i ned usi ng t he and speaker compari s on scori ng usi ng C. N ot e t hat t he mat r speaker s f r om S w i t chboar d 2 a nd N I S T 2004 S R E c or i x D i s t he s ame a s i n A l gor i t hm 1. We al so ment i on t por a usi ng A l gor i t hm 1. T he di m ensi on of t he nui sance s hat t he r e l eva nce fact or f or MA P adapt a t i on can be t ubspace, U, was fi xed at 64. uned. Typi cal l y, we use a r el evance fact or of 0.01. A f ew aspect s of t he f r ont - e nd w e r e cr i t i cal f or t he best performance. F i rst , t he f ul l bandwi dt h MF CC anal ysi 6. Experi ments6. 1. S e t u p E xper i ment s w er e per f s , 0- 4kH z, per f or m ed t he best . I n our exper i ment s, w e f or m ed on t he N I S T 2008 speaker r ecogound t hat W NA P coul d t ake a dvant a ge of t he a ddi t i onal ni t i on eval uat i on ( S R E ) dat a s et . E nr ol l m ent / ve r i fi bandw i dt h f or s peaker compar i s on. S econd, our cascaded S cat i on met hodol ogy and t he eval uat i on c r i t e r i on, equal e A D i s fai r l y a ggr essive . We f ound t hat l ow- l eve l s peech r r or r at e ( E E R ) a nd mi nD C F, w er e based on t he N I S T wa s not hel pf ul i n di s cr i m i nat i on and c oul d c ont ai n c r S R E eval uat i on pl an [ 8 ] . T he m ai n f ocus of our eff or t wa oss- t a l k. F i nal l y, f eat ur e war pi ng wa s a sl i ght gai n ove s t he one conver sat i on r f eat ur e 0- 1mean and vari ance normal i zat i on.K L For our S V M s yst e m, w e used bot h t he mean- onl y K L kernel , K, 1/ 2 UBMA l gori t h m 1 W NA P subspace t r ai ni ng al gor i t hm descri bed i n [2 ] and t he new Ckernel from ( 9) . T he S V M f or a fi xe d metric, D =(λ⊗ In- 1/ 2)SI nput : Mean par a met e r backgr ound wa s c onst r uct e d f r om F i s her dat a. TNorm vect or s {mi}, w ei ght s {W i} , w i t h s p e a k e r l a b e l s { l }, and Z -Norm w ere performed i n t he s ame manner as t he and t he desi r ed corank Out put : N ui sance s ubspace, U fo r a l l CGMGMsyst em. A r e l eva nce fact or of 4 wa s u s e d f o r t h e si n uni que speaker s i n {lsi} do Find ¯ mi KK L kernel to match prior work and f or best performance. For t he CGMke r nel , a r e l eva nce fact or of 0.01 wa s used i n bot h t he ke r nel - onl y a nd S V M exper i ment s. 6. 2. R e su l t s T he fi r s t t wo l i nes of Tabl e 1 s how basel i ne s yst e ms and t hei r fo r a l l j in {j|lj= mjjs== s} ¯ m GMcomput e t i m e f r om benchmar ks. C omput at i on i s not show n f or fusi on syst em (MIT L L and MF CC+L P CC) do Let d because we are f ocusi ng on si ngl e s yst e m per f or m ance. T he next t hr ee l i nes of the t able cont rast the new Ckernel with the KK L ke r n e l from earl i er work. In t he t abl e, w e s ee t hat t he new CK L GMke r nel ( show n w i t h Z T- nor m) out per f or m s t he KK L kernel i n al l tasks. Note the K GMGMi ncorporat es t he ut t erance mi xt ur e w ei ght s a s a wa ker en d for en d en d for ˆR U= = eigs( ˆ R,corank) % ei gs produces t he ei genvect ors of y of di s count i ng uncer t a i n mi xt ur e component s i n t he i asi - 1ˆ Ut he l argest magni t ude ei genval ues U = D for R =0 fo r i DRD ˆ =1to N do nner pr oduct ; t hus, CK L al l ows a l ow er rel evance fact or t o be used. T he pr i or ker nel , K,ismoresensi t ive t o a “ noi sy” m odel t hat has mi xt ur e c omponent s t hat a r e uncert a i n and r equi res a hi gher r el evance fact or. 5. Algorithms R = R+W idj pi ni A l gori t h m 2 Compensat i on and S cori ng wi t h t he C1GMke r n e l I nput : Two sequences of f eat ur e vect or s, Xand X2 O ut put : C ompar i son s cor e , s fo r i =1to 2do T he next s et of exper i ment s w e ke rnel s i n an S V M c onfi gur at s demonstrate that the S VM trai ni performance of t he s peaker compa esul t m i ght be expect ed, si nce h or f or t he t a rget speaker give s no =par a met e r s of MA P a dapt ed U B M t o Xi Di 1/ 2 i ⊗ I - 1/ 2 364 en d for s = D2m2 mt 1D1 We al so performed si mpl e cal i brat i on and l i near fusi on n)S - n = ( λ exper i ment s w i t h our syst em. F i r st , w e exper i ment ed w i t h di f f e r e nt scor e nor mal i zat i ons, Z - N or m, T- N or m , e t c . S econd, w e demonst r at ed t he e ff ect of not usi ng E ngl i s h/ non- E ngl i s h c al i br a t i on ( w / o C a l i br at i on) [ 11 ] . T hi r d, w e f used t he MF C C ←m i= UU pdat e Ta bl e 1: A compar i s on of di ff er ent s yst e ms on t he N I S T S R E 2008, one conver sat i on t e l e phone t r ai n a nd t e st subset ( pool 6 a nd 7) . C omput e t i m e i s nor mal i zed t o a J FA basel i ne and i ncl udes c ompensat i on a nd i nner pr oduct onl y. B e st per f or m i ng s yst e ms ar e s how n in bol d f or reference.System EER minDCF EER minDCF C omput e All (%) All (× 100) Eng (%) Eng (× 100) timeB U T M F C C 2 0 S y s t e m [ 1 1 ] 5 . 7 1 2 . 9 5 2 . 8 5 1 . 4 0 1 . 0 0 M I T L L F u s e d S ys t e m [ 1 2 ] 7 . 0 0 3 . 60 3. 30 1. 60 KK L , rf =0.01 7. 22 3. 39 4. 40 2. 04 0. 08 KK L GM, rf =4 6. 11 3. 04 3. 34 1. 68 0. 08 CK L 5. 86 2. 89 3. 09 1. 57 0. 08 SVM K6. 51 3. 01 3. 51 1. 65 0. 86 SVM C6. 31 2. 95 3. 64 1. 57 0. 86 CGMGMw/o Z T-Norm 6. 83 3. 58 3. 75 1. 98 0. 08 CGMZ-Norm Only 6. 39 3. 12 3. 33 1. 64 0. 08 CGMT-Norm Only 6. 72 3. 31 3. 63 1. 70 0. 08 CGMw/o Calibration 6. 70 3. 68 3. 09 1. 57 0. 08 CGMGMLPCC 6. 18 2. 85 2. 99 1. 59 0. 08 CFu se LPCC+M F CC 5. 27 2. 58 2. 91 1. 36 system score l inearly with an L P CC-based system. T he L P C C f r ont end was a m i nor change of t he H T K confi gur at i on— 18 cepst ral coeffi ci ent s wi t h energy were used al ong wi t h del t as and accel erat i on f or a t ot al of 57 feat ures. T he resulting syst em al s o performed wel l and f used wi t h our base MF CC syst em t o achi eve subst a nt i a l l y bet t er per f or m ance. F usi on of mul t i pl e feat ure t ypes w i t h t he s ame s yst em has i mproved performance i n many syst ems [ 11, 13] . 6. 3. A n al ysi s We not e t hat our best per f or m i ng s yst e m, CGMwith ZT-Norm, performs w el l i n compari son t o ot her syst ems i n t he l i t erat ure [ 11 ] and [ 12 ] . A key poi nt of our cur r e nt wor k i s t hat our comput at ion and impl ementation i s simpl ified with respect to previous met hods. Our new CGMsystem with WNAP reduces complexity over ol der syst ems w i t h si mi l ar performance. F i rst , we have show n t hat t he S V M t r a i ni ng i n [ 12] i s not necessar y f or t he speaker compar i s on t a sk. S econd, our syst em does not r e qui r e the use of joint fact or anal ysi s ( J FA ) [ 3 ] . J FA r e qui r e s c onsi derabl y mor e r e sour ces i n bot h c or por a a nd comput at i on. For t he speaker subspace i n JFA, a l arge corpora i s need t o model i nt erspeaker va r i at i on. For c omput at i on, t he J FA syst em r e qui r e s t he sol ut i on of bot h s peaker and c hannel fact ors. Compensat i on and scor i ng w i t h J FA can be an or der of m agni t ude sl ow er [ 4 ] i n our benchmar ks. I n bot h t he S V M a nd JFA case, f ur t her r esear ch i s needed t o under s t a nd per f or m ance i n t a sks w her e mor e ( or l ess) speaker dat a i s avai l a bl e. 7. Conclusions A new ke r nel f or s peaker compar i s on based upon an appr oxi mat e KL divergence was present ed. We s howed t hat s ubspacebased c hannel c ompensat i on c oul d be t r a i ned and i mpl e ment ed w i t h a s i m pl e a l gor i t hm, W NA P. A n anal ysi s of va r i ous confi gur at i ons of t he s yst e m demonst r at ed t hat si mpl e appr oxi mat e K L scor i ng a nd W NA P pr oduced excel l e nt per f or m ance i n compar i s on t o S V M a nd JFA s yst e ms. S ever al met hods f or achi evi ng st at e-of-t he-art performance were present ed. 8. References [ 1 ] W. M . C am pbell, “G ener alized linear dis cr im inant s equence kernels for s peaker recognition,” in ICA SSP , 2002, pp. 161–164. 365 [ 2 ] W. M . C am pbell, D . E . S tur im , D . A . R ey nolds , and A. Solom onoff, “SVM bas ed s peaker verification u s ing a G M M s uper vector ker n el and NA P var iability com p ens ation,” in ICA SSP , 2006, pp. I97–I100. [3] P. Kenny, P. O uellet, N. Dehak, V. Gupta, and P. D um ouchel, “A s tudy of inter- s p eaker var iability in s p eaker ver ifi cation,” IEEE Trans actions on Audio, Speech and Language P roces s ing, 2008. [ 4 ] W. M . C am pbell, Z . K ar am , and D . E . S tur im , “S peaker com p aris on with inner p roduct d is crim inant f unctions ,” in A d vances in Neural Infor m ation P ro ces s ing Sys tems 2 2 , Y. Bengio, D. Schuur m ans , J . L aff er ty, C. K . I . William s , and A . Culotta, E ds . , 2009, pp. 207–215. [5] W. M . C am pbell, “Weighted nuis ance attribute p rojection,” in s ubm itted to P ro c. O d ys s ey 2010: T he Speaker and L anguage R ecognition Wor ks hop , 2010. [6] D ouglas A. Reynolds , T. F. Q uatieri, and R. Dunn, “Speaker verifi cation u s ing adapted G aus s ian m ixtur e m odels ,” D igital S ignal Processing, vol. 10, no. 1-3, pp. 19–41, 2000. [7] Bernhard Sch ölkopf, A lex J . S m o la, and Klaus - Robert M üller, “Kernel p rincipal com ponent analys is ,” in A d vances in Kernel Methods , Bernhard Sch ölkopf, Chris topher J . C. Burges , and Alex ander J . S m o la, E ds . , pp. 327–352. MIT P res s , Cam bridge, Mas s achus etts , 1999. [8] M. A . P rzybocki, A. F. Martin, and A. N. L e, “NIST s p eaker r ecognition evaluations utilizing the M ixer cor por a— 2004, 2005, 2006,” IE E E Trans . o n Speech, Au dio, Lang. , vol. 15, no. 7, pp. 1951–1959, 2007. [9] J . O dell, D. Ollas on, P. Woodland, S. Yo ung, and J . J ans en, The HTK B ook for H TK V 2. 0, C am br idge U n iver s ity P r es s , Cam bridge, UK, 1995. [10] J . Pelecanos and S . S ridharan, “Feature warping for r obus t s p eaker ver ifi cation,” in P roc. o f Speaker Odys s ey Wor ks hop , 2001, pp. 213–218. [11] L . Burg et, V. H ubeika, O. Glem bek, M. Karafiat, M. Kockm ann, P. M atejka, P etr S chwar z, and J . C er nocky, “BU T s y s tem f o r N I S T 2008 s p eaker recognition evaluation,” in P roc. I nters p eech , 2009, pp. 2335–2338. [12] D. Sturim , W. M . Cam pbell, Z . Karam , D. A. Reynolds , and F. Richards on, “T he MIT L incoln L aboratory 2008 s peaker recognition s ys tem ,” in P roc. I nters p eech , 2009, pp. 2359–2362. [13] W. M. Cam pbell, D. E . Sturim , W. S hen, D. A. Reynolds , and J. Navr átil, “T he MIT - L L /IBM 2006 s p eaker recognition s ys tem : High-perform ance reduced-com plex ity recognition,” in ICA SSP , 2007, pp. IV–217–IV–220.

Simple and Efficient Speaker Comparison using Approximate KL Divergence

Products

Support

Simple and Efficient Speaker Comparison using Approximate KL Divergence

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib