Simple and Efficient Speaker Comparison using Approximate KL Divergence

advertisement
INTERSPEECH 2010
Simple and E fficient Speaker Comparison us ing Approximate KL Diverg ence*
W. M. Ca
mpbell†, Z . N . K a r a m †‡
†MIT
Lincoln Laboratory, Lexington, MA
‡DSPG, Research Laborat ory o f E l ect roni cs at M IT, Cam b ri dge M A
Abstract
We descri be a s i m pl e, novel , and effi ci ent s yst em f or
speaker compar i s on w i t h t wo m ai n c omponent s. F i r s t , t
he s yst e m uses a new appr oxi mat e K L dive rgence di st ance
ext e ndi ng ear l i e r GMM paramet er vect or S V M kernel s .
T he approxi mat e di st a nce i ncor por at es dat a - dependent
mi xt ur e w ei ght s a s w el l a s t he s t a ndar d MA P - adapt e
d G MM mean par a met e r s . S econd, t he s yst e m a ppl i e s a
w e i ght ed nui sance pr oj ect i on m et hod f or channel c
ompensat i on. A s i m pl e e i genvect or met hod of t r ai ni ng i
s present ed. T he resul t i ng speaker compari s on syst em i s st r
a i ght f or war d t o i mpl e ment and i s c omput at i onal l y si
mpl e — onl y t wo l ow - r a nk mat r i x mul t i pl i es and a n i
nner pr oduct a r e needed for compari son of t wo GMM paramet
er vect ors. We demonst r at e t he appr oach on a N I S T 2008
speaker r ecogni t i on eval uat i on t a sk. We pr ovi de i nsi ght i
nt o w hat met hods, par amet er s, and f eat ur es ar e c r i t i cal f
or good per f or m ance. Index Terms: s peaker recogni t i on
1. Introduction
Text - i ndependent speaker compar i s on i s t he pr ocess of t aki
ng t wo s peech ut t erances and provi di ng a m at ch score or
post eri or pr obabi l i t y of mat c h. S peaker compar i s on can
be consi der ed t o be a c or e bui l di ng bl ock f or bui l di ng s
peaker r ecogni t i on syst e ms. S t a ndar d appr oaches t o
compar i s on i ncl ude t r ai ni ng and testing using a classifier or
building speaker ut terance kernels, e.g. [1, 2 ].
S peaker compari s on can be i m pl ement ed usi ng many di
ff e r e nt cl assi fi e r s . We f ocus on appr oaches usi ng a G
MM uni ve r s al backgr ound model ( G MM U B M) . S peaker
compar i s on i s accompl i shed usi ng S VM ke rnel t echni ques
[ 2 ] . I n t hi s s t r uct ure, a GMM U BM i s adapt ed per ut t
erance and t he r esul t i ng m odel s ar e c ompar e d usi ng an
appr oxi mat e K L dive rgence. T hi s f r a mewor k i s si mpl e
and i nt ui t ive f or s peaker r ecogni t i on si nce ut t erances are
r epresent ed usi ng G MM paramet er vect ors and s peaker
compar i s on i s a s i m pl e i nner pr oduct .
S i gni fi cant i mprovement s i n error r at es for s peaker
compari s on can be obt ai ned by usi ng dat a - dr ive n s ubspace
model s f or channel a nd speaker represent a t i on. Two s i gni fi
cant a pproaches ar e nui sance a t t r i but e pr oj ect i on ( NA P
) and j oi nt fact or anal ysi s ( JFA ) . NA P [ 2] uses a fi xed or t
hogonal pr oj ect i on t o r emove nui sance di r ect i ons from t
he GMM paramet er vect or. Typi cal l y, t hi s nui sance i s m
odel e d a s s essi on va r i at i on. JFA [ 3] model s bot h t he
speaker and s essi on vari at i on w i t h subspaces. Fact ors
(coordi nat es) for t he subspaces are derived usi ng a MAP cri t
er i on w i t h a pr i or on t he fact or s.
C ombi ni ng c ompar i son m et hods w i t h subspace met
hods wa s s t udi ed ext e nsivel y i n t he i nner pr oduct di s cr i
m i nant f unc*T
h is wo rk was s pons ored by the F ederal Bureau of Inves tigation
tion (IPDF) framework [4]. IP D F s c onsi der ed numer ous
combi nat i ons of cl assi fi e r s and c ompensat i on m et hods
and f ound t wo key aspect s of good per f or m ance. F i r s t , cl
assi fi cat i on m et hods i ncor por at i ng bot h
speaker-dependent mean and m i xt urew e i ght par a met e r s
gave si gni fi cant i mpr ove ment ove r m eanonl y s yst e ms. S
econd, subspace channel c ompensat i on pr ovi ded t he bul k of
syst em per f or m ance i m pr ove ment s.
I n t hi s paper, w e pr esent a n a ppr oxi mat e K L dive
rgence kernel combi ned wi t h wei ght ed NA P ( W NAP ) [ 5] t
hat i m pl ement s t he key i nsi ght s f r om t he I P D F f r a
mewor k. O ur st r a t egy i s t o focus on an easy t o i mpl ement
syst em t hat i s effi ci ent and achi eve s s t a t e - of - t he- per f
or m ance. A n added bonus of our r e sul t i s t hat our r e sul t i
ng met hod i s an S V M ker nel a nd can be used i n f ut ur e wor
k f or ot her speaker r ecogni t i on t a sks.
I n t hi s paper, w e fi r st cove r t he t op- l eve l s peaker
compari s on f r a mewor k i n S ect i on 2, t hen w e pr e sent an
appr oxi mat e K L - divergence m et hod i n S ect i on 3. S ect i
on 4 di scusses W NA P and t he cor r e spondi ng t r ai ni ng cr i
t er i on. S ect i on 5 pr esent s al gor i t hms f or speaker compar
i s on met hod. F i nal l y, exper i ment s i n S ect i on 6 demonst
r at e t he effect iveness of t he met hod and pr ovi de i nsi ght i nt
o key m et hods f or achi evi ng good per f ormance.
2. GMM Parameter Vectors
A s t a ndar d di st r i but i on used f or t ext - i ndependent
speaker r ecogni t i on i s t he G a ussi an mi xt ur e m odel [ 6] ,
N
g ( x ) = i=1XλiN (x |mi,Si). (1)
F eat ure vect ors are t ypi cal l y cepst ral coeffi ci ent s wi t h
associ at ed smoot hed fi r s t - and s econd- or der der ivat ives.
A s equence of f eat ur e vect or s, X =(x i), from a s peaker i s
mapped t o a G MM by a dapt i ng a G MM unive r s al backgr
ound model ( U B M) . We w i l l assume onl y t he mi xt ur e w
e i ght s, λ, and means, mi1, · · · , x Nxx, w h e r e , i n ( 1) ar e a
dapt ed. A dapt at i on of the means i s performed with standard
relevance MAP [6]. We estimate the mixture weight s using the
standard ML estimate. The adapt at ion yields new parameters
which we stack into a par a met e r vect or, p
(2) =ˆ
λx,1· · · λ . (3) S peaker compari s on i s t he process of
compari ng t wo s exquences of f eat ur e vect or s, X and Y . R at her t han c
ompar e t hese di r ect l y, w e c ompar e t he c or r e spondi ng
par a met e r vectors, pand px,pyy, obt ai ned from s eparat el y
adapt i ng t he GMM UBM t o X and Y . T he goal i s t o pr
ovi de a c ompar i son f unction C(p)t hat pr oduces a val ue r
e fl ect i ng t he si mi l a r i t y of t he s peakers r epresent ed
by t he t wo paramet er vect ors.
un
de
r
A
ir
Fo
rce
Co
ntr
act FA
8721-0
5-C-00
02.
Opinio
ns ,
interpr
etation
s,
concl
us
ions ,
and
recom
m
endati
ons
are
tho
se
of
the
aut
hor
s
and
are
not
neces
s
arily
endor
s ed b
y the
Unite
dS
tates G
p
overnm ent. x
= ˆ λt x
mt
x˜t
x,N
mt x,1
··
·
mt
x,N
362Copyright © 2010 ISCA 26-30 September 2010, Makuhari, Chiba, Japan
˜t
where DU i s a l i near l y i ndependent set , and t he met
ricis
3. Approximate KL Diverg
ence
A n obvi ous st r a t egy f or c ompar i ng t he G MM par a met
= Dx- Dy2. ( 12)
x- yD
e r vect or s i s t o use t he K L dive rgence bet w een t he di s t
r i but i ons
T he process of proj ect i on, e. g. y = PU, Db, i s e quiva l e nt t
D(gxgy) = Z Rn gx( x ) l o g gx(x ) dx . (4)
gy(x ) «
o sol vi ng t he l east - squar e s pr obl em,
„
D
ˆx
Usi ng t he KL divergence di rect l y i s di ffi cul t because i t
=argmi
cannot be comput ed i n cl osed f or m . T her e f or e , a n a
n
ppr oxi mat e K L di vergence has been used successful l y i n
speaker recogni t i on [2].
x
Ux- b
D(gxgy) =
i=1X
A n appr oxi mat i on based on t he l og- sum i nequal i t y i s
appl i e d t o ( 4) t o spl i t out i ndivi dual m i xt ur e s t o obt ai n
( 13)
U, D
. I n pract i ce, t he proj ect i on (
r i x m ul t i pl y by or t honor ma
subspace w i t h r e spect t o t he a
⊥
N
t he form Qi. T he use of t hi s pr oj ect i on i s t o r
For W NA P, w e use a gener a l
educe nui sances pr esent i n t he expansi on pr oposed i n S o U
ect i on 2. T he mai n assumpt i on i s t hat t he nui sance i s c
onfi ned t o a “smal l ” di m ensi onal s ubspace of t he expansi
on s pace.
U, Dof
i s f r om t he U B M. N ot e t hat w e have dr opped t he
t e r m D(λ), s i nce w e ar e fi ndi ng an upper bound and t he K
L divergence i s always great er t han zero.
s(px,pyB y symmet r i z i ng ( 5) and s ubst i t ut i ng i n t he K
L dive rgence bet w een t wo G aussi an di st ri but i ons, w e
obt ai n a di st ance whi ch upper bounds t he s ymmet r i c K
L dive rgence, d),
y, i
Here, Si
S
xλy
N
For t he W NA P t r ai ni ng set , w e assume t hat f or eve r y
speaker (i n general , every cl ass), t hat w e can est i mat e a “l
ow noi se” vect or ¯ x from w hi ch del t as can be cal cul at ed.
In pract i ce, t hi s smoot hed vect or i s f or m ed by adapt i ng a
m odel f r om t he dat a pool ed acr oss m ul t i pl e ut t e r a nces
f r om t he same speaker. We t hen base our cr i t er i on on a ppr
oxi mat i ng t hese del t a s.
s,i
i=1X(0.5λx,i
- 1 i(mx,i- my, i ). (6)
+0.5λ )(mx,i- my, i )t
A c or r e spondi ng i nner pr oduct t o t hi s di s
t a nce i s
Mor e speci fi cal l y, s uppose w e have a t r a i ni ng s et , {zs}
l a bel e d by s peaker, s, and i nst ance, i. For each
s, wehavea smoot hed vect or, ¯ z
. For W NA P t r ai ni ng, w e use t he f ol l ow i ng
optimization probl em,
+0.5λ y, i )mt x,iS- 1 imy, i . (7)
N
CKL(px,py) = i=1X(0.5λx,i
s,i W s,iPU, Ds,ids,i
min sX iX
- ds,i
2 Ds,i
s,i
(
14)
s,i
i
KLNot
e t hat (7) can al so be expressed m ore compact
l y as C(px,py) = m t x((0.5λx+0.5 λ y)⊗ In1) S miy(8) where S is the block matrix with the Sf r
om t he U B M on t he di agonal , n i s t he f eat ur e
vect or di mensi on, and ⊗ is the Kronecker pr oduct
. N ot e t hat shi f t i ng t he m eans by t he U B M w i
l l not affect t he di s t ance i n ( 6), s o w e can repl
ace means i n ( 8) by t he U BM cent ered means.
where ds,i
U=
zs,i ¯ z s. T he W NA P t r ai ni ng cr i t er i on ( 14) i ncorporat es t he
goal s of usi ng a vari abl e met r i c and an ut t erance
dependent w e i ght i ng, W, see [5]. The t raining criterion
attempts to find a subspace U t hat best appr oxi mat e s t he
nui sance das i n pr i or wor k [ 2] . For t he pur poses of t hi s
wor k, w e a ssume t hat D= D is a c onst a nt . P r i or wor k
has s how n t hat t hi s i s a good compr omi se i n
performance and comput at i onal effi ci ency [5]. In t he case
of const a nt D, t he W NA P cr i t e r i on can be show n t o be
e quiva l e nt t o t he f ol l ow i ng pr obl em. F i r st , w e i
⊗ In)my
⊗ C
G
M
ncor por at e t he
Wi nto the dby l e t t i ng, ˆ= v W i
N ˆ diˆ dt i. ( 16)
i=1
1
I
(
p
n
x
,
p
)
S
(
λ
X
R=
1
/
y
)
=
m
2
y
t
x
(
λ
1
/
2
x
(9) where S i s t he bl ock di agonal of t he U B M c ovar i a
nces. I n exper i ment s, w e have f ound ( 9) t o be a good appr oxi mat i
on of ( 8) . We ment i on t hat t he cor r e spondi ng S V M
expansi on t o t he
T he
c ompar i son f unct i on, CKLKL, does not cor r e spond t o an i
nner pr oduct i n t he Mer cer sense. T hat i s , w e cannot separ a
t e Ci nt o an i nner pr oduct of t he f or m b(px)tb(py)where b(·)
i s some mappi ng f unct i on. A s i m pl e s ol ut i on t o t hi s pr
obl em i s t o repl ace t he ari t hmet i c m ean bet w een mi xt ure
w ei ght s i n ( 8) wi t h a geomet r i c mean; w e obt ai n
kernel (9) i s b(px) = ( λ 1/ 2 x ⊗ - 1/ 2)S
x. ( 10)
I
m
n
ˆ U, ˆ U t ˆ
U =I
. ( 15) S econd, w e fi nd t he co
ˆ T hen, t he cr i t e r i on ( 14) can be expr
Uessed a s,
max tr h tˆ R ˆ Ui ( 17)
4. WNAP
U, DB
I n t he e quat i on, U i s t he desi r ed nui sance s ubspace, ˆ U =
DU,
and ˆ R = DRD. T hi s pr obl em can be sol ved usi ng a n e i
genvect or met hod t hat w i l l be pr esent e d i n S ect i on
5. I nt ui t ive l y, t he pr obl em ( 17) fi nds a l ow r a nk
appr oxi mat i on, U,thatbest approxi mat e s t he nui
sance s ubspace.
e f or e defi ni ng W NA P, w e i nt r oduce s ome not at i on.
We defi ne an or t hogonal pr oj ect i on w i t h r e spect t o a m et r
i c, P, wh e r e D a n d U a r e f u l l r a n k m a t r i c e s a s
PU, D
= U(UtD2U)- 1UtD2
( 11)
363
enr ol l , one conver sat i on ve r i fi cat i on t ask f or t e l e phone
channel speech. T-Norm model s and Z -Norm s peech ut t
Our m et hod for s peaker compari s on can be spl i t i nt o t wo
erances were dr aw n f r om t he N I S T 2004 S R E c or pus. R
component s— t r ai ni ng t he nui sance s ubspace and per f or m
esul t s w e r e obt ai ned f or bot h t he E ngl i s h onl y ( E ng,
i ng speaker compar i s on scor i ng. B ot h al gor i t hms a r e st r
pool 7) and f or al l di ff e r e nt number t r i al s ( A l l , pool 6) w
a i ght f orward to impl ement with matrix tool s such as Mat lab.
hi c h i ncl udes s peaker s t hat e nr ol l / ve r i f y i n di ff er ent l
iTr ai ni ng t he nui sance s ubspace i s show n i n A l gor i t hm 1.
anguages.
For t he t r ai ni ng set , w e fi r st not e t hat onl y t he ut t e r a nce
F eat ure ext ract i on was performed usi ng H T K [9] w i t h
MA P adapt e d m eans f or t he vect ors miar e used. A t ypi cal
20 MF CC coeffi ci ent s, del t as, and accel erat i on coeffi ci ent
dat a set f or t r a i ni ng t he nui sance s ubspace woul d have s
s for a t ot al of 60 feat ures. S peech act ivi t y det ect i on ( S
ever al sessi ons per s peaker— t ypi cal l y 8 or mor e . A second
AD) was performed usi ng a cascade of t wo syst ems. F i rst , a
comment on A l gor i t hm 1 i s t hat t he met r i c , D, used f or t
G MM speech/ non- speech det ect or wa s a ppl i e d. T hen, t
r ai ni ng t he s ubspace i s not ut t e r a nce dependent . I n pr act
hese S A D mar ks w er e post - pr ocessed w i t h a n e nergyi ce, t hi s has not i m pact ed per f or m ance. T hi r d, w e ment i
based det ect or. F eat ur e s f r om non- speech f r a mes w er e e
on t hat one good choi ce f or Wis t he number of s peech frames
l i mi nat e d a nd t hen f eat ur e warping [10 ] was applied to all
det ect ed by speech act ivi t y det ect i on. Four t h, w e m ent i
of the resulting features with a 3 s econd w i ndow.
on t hat i n t he a l gor i t hm, ker nel P C A can be used as an al t
A G MM U B M w i t h 512 mi xt ur e c omponent s was t r ai
e r nat ive t o t he di r ect par a met e r expansi on [ 7] .
ned usi ng dat a f r om N I S T S R E 2004 and f r om S w i t c
GMI n A l gor i t hm 2, w e show compensat i on usi ng W NA P
hboar d corpor a. A nui sance s ubspace wa s t r a i ned usi ng t he
and speaker compari s on scori ng usi ng C. N ot e t hat t he mat r
speaker s f r om S w i t chboar d 2 a nd N I S T 2004 S R E c or
i x D i s t he s ame a s i n A l gor i t hm 1. We al so ment i on t
por a usi ng A l gor i t hm 1. T he di m ensi on of t he nui sance s
hat t he r e l eva nce fact or f or MA P adapt a t i on can be t
ubspace, U, was fi xed at 64.
uned. Typi cal l y, we use a r el evance fact or of 0.01.
A f ew aspect s of t he f r ont - e nd w e r e cr i t i cal f or t he
best performance. F i rst , t he f ul l bandwi dt h MF CC anal ysi
6. Experi ments6. 1. S e t u p E xper i ment s w er e per f
s , 0- 4kH z, per f or m ed t he best . I n our exper i ment s, w e f
or m ed on t he N I S T 2008 speaker r ecogound t hat W NA P coul d t ake a dvant a ge of t he a ddi t i onal
ni t i on eval uat i on ( S R E ) dat a s et . E nr ol l m ent / ve r i fi
bandw i dt h f or s peaker compar i s on. S econd, our cascaded S
cat i on met hodol ogy and t he eval uat i on c r i t e r i on, equal e
A D i s fai r l y a ggr essive . We f ound t hat l ow- l eve l s peech
r r or r at e ( E E R ) a nd mi nD C F, w er e based on t he N I S T
wa s not hel pf ul i n di s cr i m i nat i on and c oul d c ont ai n c r
S R E eval uat i on pl an [ 8 ] . T he m ai n f ocus of our eff or t wa
oss- t a l k. F i nal l y, f eat ur e war pi ng wa s a sl i ght gai n ove
s t he one conver sat i on
r f eat ur e 0- 1mean and vari ance normal i zat i on.K L For our S
V M s yst e m, w e used bot h t he mean- onl y K L kernel , K,
1/ 2 UBMA l gori t h m 1 W NA P subspace t r ai ni ng al gor i t hm
descri bed i n [2 ] and t he new Ckernel from ( 9) . T he S V M
f or a fi xe d metric, D =(λ⊗ In- 1/ 2)SI nput : Mean par a met e r
backgr ound wa s c onst r uct e d f r om F i s her dat a. TNorm
vect or s {mi}, w ei ght s {W i} , w i t h s p e a k e r l a b e l s { l },
and Z -Norm w ere performed i n t he s ame manner as t he
and t he desi r ed corank Out put : N ui sance s ubspace, U fo r a l l
CGMGMsyst em. A r e l eva nce fact or of 4 wa s u s e d f o r t h e
si n uni que speaker s i n {lsi} do Find ¯ mi
KK L
kernel to match prior work and f or best performance. For t he
CGMke r nel , a r e l eva nce fact or of 0.01 wa s used i n bot h t
he ke r nel - onl y a nd S V M exper i ment s.
6. 2. R e su l t s T he fi r s t t wo l i nes of Tabl e 1 s how basel
i ne s yst e ms and t hei r
fo r a l l j in {j|lj= mjjs== s} ¯ m
GMcomput e t i m e f r om benchmar ks. C omput at i on i s not
show n f or fusi on syst em (MIT L L and MF CC+L P CC)
do Let d
because we are f ocusi ng on si ngl e s yst e m per f or m ance.
T he next t hr ee l i nes of the t able cont rast the new Ckernel
with the KK L ke r n e l from earl i er work. In t he t abl e, w e s
ee t hat t he new CK L GMke r nel ( show n w i t h Z T- nor m)
out per f or m s t he KK L kernel i n al l tasks. Note the K
GMGMi ncorporat es t he ut t erance mi xt ur e w ei ght s a s a wa ker
en d for en d en d for ˆR U= = eigs( ˆ R,corank) % ei gs produces t he ei genvect ors of
y of di s count i ng uncer t a i n mi xt ur e component s i n t he i asi
- 1ˆ Ut he l argest magni t ude ei genval ues U = D
for R =0 fo r i DRD ˆ
=1to N do
nner pr oduct ; t hus, CK L al l ows a l ow er rel evance fact or t
o be used. T he pr i or ker nel , K,ismoresensi t ive t o a “
noi sy” m odel t hat has mi xt ur e c omponent s t hat
a r e uncert a i n and r equi res a hi gher r el evance
fact or.
5. Algorithms
R = R+W idj
pi ni
A l gori t h m 2 Compensat i on and S cori ng wi t h t he
C1GMke r n e l I nput : Two sequences of f eat ur e vect or s, Xand
X2
O ut put : C ompar i son s cor e , s fo r i =1to 2do
T he next s et of exper i ment s w e
ke rnel s i n an S V M c onfi gur at
s demonstrate that the S VM trai ni
performance of t he s peaker compa
esul t m i ght be expect ed, si nce h
or f or t he t a rget speaker give s no
=par a met e r s of MA P a dapt ed U B M t o Xi
Di
1/ 2 i ⊗ I
- 1/ 2
364
en d for s =
D2m2
mt 1D1
We al so performed si mpl e cal i brat i on and l i near fusi on n)S - n = ( λ
exper i ment s w i t h our syst em. F i r st , w e exper i ment ed
w i t h di f f e r e nt scor e nor mal i zat i ons, Z - N or m, T- N
or m , e t c . S econd, w e demonst r at ed t he e ff ect of not
usi ng E ngl i s h/ non- E ngl i s h c al i br a t i on ( w / o C a l i
br at i on) [ 11 ] . T hi r d, w e f used t he MF C C
←m
i=
UU
pdat e
Ta bl e 1: A compar i s on of di ff er ent s yst e ms on t he N I S T S R E 2008, one conver sat i on t e l e phone t r ai n a nd t e st subset (
pool 6 a nd 7) . C omput e t i m e i s nor mal i zed t o a J FA basel i ne and i ncl udes c ompensat i on a nd i nner pr oduct onl y. B e st per f
or m i ng s yst e ms ar e s how n in bol d f or reference.System EER minDCF EER minDCF C omput e All (%) All (× 100) Eng (%) Eng
(× 100) timeB U T M F C C 2 0 S y s t e m [ 1 1 ] 5 . 7 1 2 . 9 5 2 . 8 5 1 . 4 0 1 . 0 0 M I T L L F u s e d S ys t e m [ 1 2 ] 7 . 0 0 3 .
60 3. 30 1. 60 KK L , rf =0.01 7. 22 3. 39 4. 40 2. 04 0. 08 KK L GM, rf =4 6. 11 3. 04 3. 34 1. 68 0. 08 CK L 5. 86 2. 89 3. 09 1. 57 0. 08 SVM K6. 51
3. 01 3. 51 1. 65 0. 86 SVM C6. 31 2. 95 3. 64 1. 57 0. 86 CGMGMw/o Z T-Norm 6. 83 3. 58 3. 75 1. 98 0. 08 CGMZ-Norm Only 6. 39 3.
12 3. 33 1. 64 0. 08 CGMT-Norm Only 6. 72 3. 31 3. 63 1. 70 0. 08 CGMw/o Calibration 6. 70 3. 68 3. 09 1. 57 0. 08 CGMGMLPCC 6. 18
2. 85 2. 99 1. 59 0. 08 CFu se LPCC+M F CC 5. 27 2. 58 2. 91 1. 36 system score l inearly with an L P CC-based system. T he L P C
C f r ont end was a m i nor change of t he H T K confi gur at i
on— 18 cepst ral coeffi ci ent s wi t h energy were used al ong
wi t h del t as and accel erat i on f or a t ot al of 57 feat ures. T he
resulting syst em al s o performed wel l and f used wi t h our base
MF CC syst em t o achi eve subst a nt i a l l y bet t er per f or m
ance. F usi on of mul t i pl e feat ure t ypes w i t h t he s ame s yst
em has i mproved performance i n many syst ems [ 11, 13] .
6. 3. A n al ysi s We not e t hat our best per f or m i ng s yst e m,
CGMwith ZT-Norm, performs w el l i n compari son t o ot her
syst ems i n t he l i t erat ure [ 11 ] and [ 12 ] . A key poi nt of our
cur r e nt wor k i s t hat our comput at ion and impl ementation i s
simpl ified with respect to previous met hods.
Our new CGMsystem with WNAP reduces complexity over ol der
syst ems w i t h si mi l ar performance. F i rst , we have show n t
hat t he S V M t r a i ni ng i n [ 12] i s not necessar y f or t he
speaker compar i s on t a sk. S econd, our syst em does not r e qui
r e the use of joint fact or anal ysi s ( J FA ) [ 3 ] . J FA r e qui r e
s c onsi derabl y mor e r e sour ces i n bot h c or por a a nd
comput at i on. For t he speaker subspace i n JFA, a l arge
corpora i s need t o model i nt erspeaker va r i at i on. For c
omput at i on, t he J FA syst em r e qui r e s t he sol ut i on of bot
h s peaker and c hannel fact ors. Compensat i on and scor i ng w i
t h J FA can be an or der of m agni t ude sl ow er [ 4 ] i n our
benchmar ks. I n bot h t he S V M a nd JFA case, f ur t her r esear
ch i s needed t o under s t a nd per f or m ance i n t a sks w her e
mor e ( or l ess) speaker dat a i s avai l a bl e.
7. Conclusions
A new ke r nel f or s peaker compar i s on based upon an appr oxi
mat e KL divergence was present ed. We s howed t hat s
ubspacebased c hannel c ompensat i on c oul d be t r a i ned and i
mpl e ment ed w i t h a s i m pl e a l gor i t hm, W NA P. A n anal
ysi s of va r i ous confi gur at i ons of t he s yst e m demonst r at
ed t hat si mpl e appr oxi mat e K L scor i ng a nd W NA P pr
oduced excel l e nt per f or m ance i n compar i s on t o S V M a
nd JFA s yst e ms. S ever al met hods f or achi evi ng st at e-of-t
he-art performance were present ed.
8. References
[ 1 ] W. M . C am pbell, “G ener alized linear dis cr im inant s equence
kernels for s peaker recognition,” in ICA SSP , 2002, pp. 161–164.
365
[ 2 ] W. M . C am pbell, D . E . S tur im , D . A . R ey nolds , and A.
Solom onoff, “SVM bas ed s peaker verification u s ing a G M M s uper
vector ker n el and NA P var iability com p ens ation,” in ICA SSP , 2006,
pp. I97–I100.
[3] P. Kenny, P. O uellet, N. Dehak, V. Gupta, and P. D um ouchel, “A s
tudy of inter- s p eaker var iability in s p eaker ver ifi cation,” IEEE Trans
actions on Audio, Speech and Language P roces s ing, 2008.
[ 4 ] W. M . C am pbell, Z . K ar am , and D . E . S tur im , “S peaker
com p aris on with inner p roduct d is crim inant f unctions ,” in A d
vances in Neural Infor m ation P ro ces s ing Sys tems 2 2 , Y. Bengio, D.
Schuur m ans , J . L aff er ty, C. K . I . William s , and A . Culotta, E ds .
, 2009, pp. 207–215.
[5] W. M . C am pbell, “Weighted nuis ance attribute p rojection,” in s
ubm itted to P ro c. O d ys s ey 2010: T he Speaker and L anguage R
ecognition Wor ks hop , 2010.
[6] D ouglas A. Reynolds , T. F. Q uatieri, and R. Dunn, “Speaker verifi
cation u s ing adapted G aus s ian m ixtur e m odels ,” D igital S ignal
Processing, vol. 10, no. 1-3, pp. 19–41, 2000.
[7] Bernhard Sch ¨olkopf, A lex J . S m o la, and Klaus - Robert M ¨uller,
“Kernel p rincipal com ponent analys is ,” in A d vances in Kernel
Methods , Bernhard Sch ¨olkopf, Chris topher J . C. Burges , and Alex
ander J . S m o la, E ds . , pp. 327–352. MIT P res s , Cam bridge, Mas s
achus etts , 1999.
[8] M. A . P rzybocki, A. F. Martin, and A. N. L e, “NIST s p eaker r
ecognition evaluations utilizing the M ixer cor por a— 2004, 2005,
2006,” IE E E Trans . o n Speech, Au dio, Lang. , vol. 15, no. 7, pp.
1951–1959, 2007.
[9] J . O dell, D. Ollas on, P. Woodland, S. Yo ung, and J . J ans en, The
HTK B ook for H TK V 2. 0, C am br idge U n iver s ity P r es s , Cam
bridge, UK, 1995.
[10] J . Pelecanos and S . S ridharan, “Feature warping for r obus t s p
eaker ver ifi cation,” in P roc. o f Speaker Odys s ey Wor ks hop , 2001, pp.
213–218.
[11] L . Burg et, V. H ubeika, O. Glem bek, M. Karafiat, M. Kockm ann,
P. M atejka, P etr S chwar z, and J . C er nocky, “BU T s y s tem f o r N I
S T 2008 s p eaker recognition evaluation,” in P roc. I nters p eech , 2009,
pp. 2335–2338.
[12] D. Sturim , W. M . Cam pbell, Z . Karam , D. A. Reynolds , and F.
Richards on, “T he MIT L incoln L aboratory 2008 s peaker recognition s
ys tem ,” in P roc. I nters p eech , 2009, pp. 2359–2362.
[13] W. M. Cam pbell, D. E . Sturim , W. S hen, D. A. Reynolds , and
J. Navr ´atil, “T he MIT - L L /IBM 2006 s p eaker recognition s ys tem
: High-perform ance reduced-com plex ity recognition,” in ICA SSP ,
2007, pp. IV–217–IV–220.
Download