Static Prediction Games for Adversarial Learning Problems

advertisement
J ou rnal of Machine L ea rning Rese arch 1 3 (201 2) 2 617-26 54
S u bmit ted 8/11; Re vised 5/12; P u bli sh ed 9 / 1 2
S tati c Pr e d iction Games f or Ad v er s arial Le ar ni ng Pr obl e ms
Mich ael Br ¨u c k ner
MIBRUECK@CS. UNI-POTSDAM. DE
D epa r tme n t o f C omp ute r Sci ence
Un i ve r s it y o f P ots d am
A u gu s t- B e b e l- S t r. 89
14482 P o t sda m, Ger ma ny
C hristian Kan z o w
KANZOW
@MATHEMATIK. UNI-WUERZBURG.DE
Ins tit u t e o f Ma t h e m ati cs
Un i ve r s it y o
¨ f W u r zb u r g
Em i l- F i sc h e r - S t r. 3 0
97074 W¨u r z b ur g , Ger ma ny
Tob ia s Sch e ffer
S C H E FF E R @ C S . U N I - P O T S D A M . D E
Depa r tment o f C omp ute r Sci ence
Un i ve r s it y o f P ots d am
A u gu s t- B e b e l- S t r. 89
14482 P o t sda m, Ger ma ny
Editor: N i col `o Ces a-Bi anchi
Abst r a ct
T he s ta n dard a ss u mpti o n o f id e n t ic all y dis tr ib u t ed tr ai n i n g a n d t est d a ta i s viola ted wh e n t h
d at a a re g e n er at ed i n r es p on s e t o t h e p r es ence o f This
a prebec
d i cti
o me
v esmod
app ea l.rent, for
e x a mp l e, i n the con t e xt of emai l s p a mHer
filt e,
er ema
in g. il s er v i ce pro v i d er s e mp l o y spam filter s, a n d s p a m s end e rs eng i n e er ca mp ai g n t empla tes to ac h i e v e a high rat e of s u cc es sf u l d e
d es pite t h e filt ers . W e mod el t h e i n t era ct io n b et wee n the lea rner and t h e d a ta genera tor as a st at
g ame in wh ic h the cost fun c tions of the l ear n e r and the d a ta g e n e rat o r ar e no t n ec es sa ri ly antago nis tic W
. e i d e n t ify c o nd i ti o ns un der which this p r edic tion g ame has a un i q ue Nas h equ il ibri u m
an d d e ri v e al g ori thms that find the equ i libri al pre d i ct io n mo del . W e d er i v e tw o i n s ta n c es , t h e N
lo gis ti c re g r es si o n a n d the Nas h s u pp ort v ec tor mac h i n e , a n d e mp i ri cal ly e xp l o r e thei r p r o p
in a cas e s tud y o n ema il s pam filt eri n g.
st at ic pre d i cti o n g a mes , adv e rs ari al cla ss ific ati o n, Nas h equ i li b r ium
K e y w ords:
1. Intr oduction
A commo n assumption on w hich most learning a lgorithms ar e based is th at trainin g and te st da ta
are go v erned by id entical dis trib utions.
H o w e v er , in a v ariety of applications, the distr ib utio n th at
go v er ns data at application time may be influenced by an a d v ersar y w hose in te re sts ar e in confl ict
with th ose of the le a rner
Consider
.
, for instance, the f ollo w ing three scenar
In compute
ios.
r and
netw ork sec urityscripts
,
that contr ol a ttacks are engin eered with botne t and intrusion detectio n
systems in mind . Credit car d defr auders a dapt th eir unauthorized use of cr edit cards—in particula r ,
amounts char g e d per tr ansa ctio ns and per day and the ty pe of b usines ses th at amounts are char ged
fr om—to a v oid tr ig gering alerting me chanisms emplo y e d by cr edit ca rd E
companies.
mail spam
sender s design me ssage templa tes that ar e ins ta ntiate d b y nodesThese
of botne
template
ts .
s a re
c 2012M ich ael B r ¨u ckne r , Christi a n Kanz o w and T o bias S c hef fer .
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
specifica lly designed to pr od uc e a lo w spam score w ith p opular spam fi lters. T he doma in of ema il
spam fi lter ing will ser v e a s a r unning e xample throughout th e paper . I n all of these applicatio ns, the
party that cr eate s th e predic ti v e model and the a dv ersar ia l pa rty that generates future data are a w a
of each o th er , and f actor the possible actions of the ir opp onent into the ir decis ions.
The interaction between le a rner and data g e nerator s can be modeled a s a g ame in whic h one
pla ye r c on trols th e pr edic ti v e mo del w her eas another e x ercises some control o v er the proc ess of
data genera tio n. The adv ersa ry’ s influ e nce on th e gener atio n o f th e data can be formally modeled as
a transformatio n th at is imposed o n th e distr ib utio n th at go v er ns the data at tr ain ing time. The tr a ns
for me d distr ib utio n then go v erns the da ta at application
T hetime.
optimiz ation criterio n of either
pla ye r tak es as a r gume nts bo th the pr edic ti v e model chosen by the le arne r and the tr ansf orma tion
car rie d out by the adv ersa ry .
T ypically , this proble m is modeled un de r the w orst-case a ssumptio n that th e adv ersa ry desires
to imp ose th e hig hest possible costs on the le aTrner
his a. mounts to a ze ro-sum g ame in whic h
the lo ss of one pla ye r is the g ain of th e other
In this
. setting, both players c an maximize their
e xpec te d outc ome by follo w in g a minimax strate
L a gy
nckrie
. t et al. (2002) study the min imax
probability machine (MPM). T his cla ss ifier minimizes the ma ximal pr o ba b ility o f mis c la ssifying
ne w instances f or a gi v en mean and co v a ria nce ma trix of e Geome
ach class.
trically , th ese class
me a ns and co v ariances define tw o h yper -ellipsoids whic h are equally scaled such that th e y intersect;
their common ta ngent is the minimax probabilistic decision h yper p lane. Gh a ou i et al. (2003) deri v e
a minima x mode l for input da ta that are kno wn to lie w ithin some h yper -rec ta ngle s around the
training instances. Their solu tio n min imiz es the w o r st- c ase loss o v er all possible choic es of the da ta
in these in te rv als
S .imilarly , w o r st- c ase solu tio ns to classification g ames in whic h the adv ers ary
delete s input featu res ( G lober son and R o w eis, 2006; Globerson e t al., 2009) or perfor ms an ar bitr a r
fea ture tr a nsforma tion (T eo e t a l., 2007; D ek e l a nd S ha mir , 2008; D e k el e t a l., 2010) ha v e bee
studied.
Se v eral applic a tio ns moti v a te proble m se tting s in w hich the goals of th e le ar n e r and th e da ta
genera tor , w hile still confl icting, ar e not nece ssarily entirely anta gonis tic. F or instance, a def rauder’ s
goal of ma ximizing th e profi t ma de fr om e xplo iting ph ished account in forma tion is not the in v er se
of an email s ervic e pro vider ’ s goal of achie ving a high spam recognition r ate at clo se-to- zero f alse
positi v es. When pla ying a minimax strate gy , one often ma k es o v erly pessimistic assumptions about
the adv er sary’ s beha vio r a nd ma y not nece ssarily obtain a n optimal outcome .
Game s in w hich a le ader —typically , the learner —commits to a n actio n first w he reas the adv er sar y ca n re act after the leader’ s action has b e en disclosed are natu rally mo
Sta
dele
c kdeas
lber
a g
competitio .nThis model is appr o pr ia te wh e n the f ollo w er —the d a ta gener ator —has full in formation abou t the predicti v e mo This
del. a ssumptio n is usually a pessimistic approximatio n of r eality
because , for insta nce , neith e r email se rvic e pr o viders nor cr edit c ard companies dis c lose a c omprehensi v e docume ntatio n of th eir curr ent security measures.
Stack elber g equilibria of adv er sarial
classification pr oble ms can be identified by solving a bile v el optimiz ation pr o blem (Br ¨uckner and
Schef fer , 2011).
This pa p e r studies
static pr edic tion g ame s in w hich both pla yer s act simu ltaneously; tha t is ,
with out prio r infor ma tion on their op ponent’ s mo v e . Whe n the optimizatio n criter ion of b oth players depends not only on their o wn action b ut also o n their op ponent’ s mo v e, then the conce pt of
a pla ye r’ s optimal ac tio n is no lo n ge r w ell-defined.
Theref o r e, w e r esort to the concept
N of
ash
a
equilibrium of static pr edic tion g ameAs.
Nash equilibrium is a pair of ac tio ns chosen such th at
no pla yer benefits f rom unilate r ally sele cting a dif fe rent action. I f a g ame has a uniq ue N ash e q ui2618
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
librium and is played by rational players tha t aim at maximizing the ir optimizatio n c riteria , it is
rea son a ble f or eac h pla yer to a ssume th at th e opponent will play according to the N a sh equilibriu m strate gyIf .one player plays acc ording to the e q uilibr ium strate gy , the optimal mo v e for the
other player is to play this equilibriu m str a te gy as Iwell.
f, ho w e v er , multip le equilib r ia e xist and
the play e rs choose their str ate gy accor d in g to distin c t ones, then the resulting combin atio n ma y be
arbitrar ily disadv antageous f or either pla yer . I t is ther efore in te re stin g to study w heth er adv er saria
predictio n g ame s ha v e a un iq ue Nash equilibriu m.
Our w ork b uild s on an approach tha t B r ¨u c kn e r and S chef fe r (2009) de v eloped for finding a
N a sh e qu ilibr ium of a static p r edic tion W
g aeme
w ill. dis cuss a fla w in Th e orem 1 of B r ¨uckner
and S c hef f er (2009) and de v elo p a r e vised v ersion of the the o r em that id entifi es conditions under
w hich a unique Nash equilibriu m of a predic tion g ame e xists . In addition to the in e xact linesea rch
approac h to finding th e equilibriu m that B r ¨uc k ne r and S chef fe r ( 200 9) de v elop, w e w ill follo w a
modified e x tragra die nt appr o a ch a nd de v elop Nash logis tic r e gressio n and the N a sh s u pport v ec
ma c h inThis
e. paper als o de v elops a k er nelized v ersion of these methods.
An e xte nded empiric a l
e v aluation e xplores the applicability of th e Nash instances in the conte xt of e ma il spam fi lterin g.
W e empiric ally v er if y th e assumptions ma de in the mo delin g proc ess and compa re the pe rformance
of N a sh instances w ith ba selin e me th ods on se v er al ema il corpora inclu ding a corpus from an ema il
service p r o vider .
The re st of this paper is or g anized as f olloSection
w s. 2 in tr oduces the problem setting.
We
formalize the N as h predictio n g ame a nd study conditions un de r w hich a unique N ash equilibriu m
e xists in Section 3. Section 4 de v elops strate g ies for identifying equilibria l pre d iction models, and
in Section 5, w e deta il on tw o instances of the Nash predictio n Igname.
S e ctio n 6, we re p or t on
e xper ime nts on e ma il spam filtering; Sectio n 7 conclu des.
2. Pr oblem Sett i ng
W e stu d y sta tic predictio n g ames between tw o players:
The
(v = q 1) and an a d v ersar y , the
le arner
=+
(v
1). In our runnin g e xample of ema il spam filtering, we study the c o mp etitio n
data g ener ator
between r ecipient a n d senders, not comp etitio n amo ng senders.
Therefor e,v = q 1 re fers to the
=+ 1 models the entirety of all le gitimate and ab usi v e ema il senders a s a single,
rec ipie nt w her veas
amalg ama ted player .
At tr aining time
, the data generator
v =+ 1 p r odu c es a sample
D = { ( xi , yi ) } ni = 1 of n tr a ining
∈Y
instancesxi ∈ X w ith corr espond in g class layibe
ls = {q 1, + 1} . These object- c la ss pairs a re
dra w n a ccording to a training distr ib ution with density function
p( x, y) . B y contr ast,
at applica tion
time the da ta gener ator produces object-cla ss pairs acc ording to some test distr ib utio n with density
( xm
, y) .
p˙( x, y) w hich ma y d if f er frpo
The task of the learnev r= q 1 is to sele c t th e paramete
w ∈r W
s ⊂ Rm of a predicti v e mo d e l
h( x) = sign fw ( x) imple men ted in terms of a gener alized lin e ar dec is ionfwfunction
: X → R w ith
T
m
φ : X → R . T he le ar ner’ s theor etical
fw ( x) = w φ( x) a n d featu re ma pping
costsat applicatio n
time are gi v e n by
θq 1( w, p˙) =
∑
Y
Z
X
cq 1( x, y) ℓ q 1( fw ( x) , y) p˙( x, y) dx,
w he rew eig htin function
g
cq 1 : X × Y → R and loss function ℓ q 1 : R × Y → R c omposethe
weighte d lossqc1( x, y) ℓ q 1( fw ( x) , y) that th e learner in curs w hen the predicti v e model classifies
2619
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
instancex a sh( x) = sign fw ( x) while th e tr ue label yis. T he positi v e class- a n d ins ta nce- specific
weightin g f a ctors
cq 1( x, y) w ithEX ,Y [cq 1( x, y)] = 1 spec if y the imp ortance o f minimizing the loss
ℓ q 1( fw ( x) , y) for th e corresponding object- c la ss
( xpair
, y) . F or instance, in spa m fi lterin g, the cor rec t classification of non-s p a m me ssages c an be b u sin ess-cr itical f or ema il service pro vide rs w hile
f a iling to detect s p a m messages runs up process ing and stor age c o sts, depending on the siz e of the
me ss age.
The data gener ator
v =+ 1 ca n modif y the data ge n e ration process at application time. In pra ctice, spa m sender s update their campa ign templa tes wh ich are dis se min ated to th e nodes of botnets.
F ormally , the d a ta ge n e rator transforms the training distr ib ution wp ith
to the
density
te st dis trib ution w ith density
by modifying the data genera tio n
p˙. T he d a ta gener ator in
tr curs
ansforma tion costs
proces s whic h is quantifi e Ω
d+by
˙) . This term acts a s a r e gula r iz er on the transformatio n and
1( p, p
ma y implic itly c o nstrain th e p ossib le dif f erence betwee n the distr ib utio ns a t training and applicatio n
time, dependin g on the natur e of the application that is to be modele d. F or insta nc e, the ema il sender
ma y not be a llo w ed to alter th e tr ain ing dis trib ution for non-spam me ssages, or to mo dif y the natu re
of the me ss ages b y changing the la be
spam
l from
to non-spamor vic e v er sa. A dditionally , changing
the trainin g dis trib ution for spam me s sages may in cur costs depending on the e x tent of d istortion
inflicte d on th e in formatio nal pa y lo
The
ad.theore tic a
costs
l of the data g e nerator a t applicatio n
time are the sum of the e xpected predictio n costs and th e transfor ma tion costs,
θ+ 1( w, p˙) =
∑
Y
Z
X
c+ 1( x, y) ℓ + 1( fw ( x) , y) p˙( x, y) dx + Ω+ 1( p, p˙) ,
wh e re,in analogy to th e learner ’ s costs
c+ 1,( x, y) ℓ + 1( fw( x) , y) quantifi e s th e weighted los s th at
the data generato r incur s w hen instance
x is la be le d as
h( x) = sig nfw ( x) w hile the true la bel is
(
,
)
[
(
,
)]
=
, y) fr om
y. The w eig htin g f actors
c+ 1 x y w ithEX ,Y c+ 1 x y
1 e xpress the sig nificance( xof
the perspecti v e of the data g e nerator . In our e xamp le sce nario, this re fl ects tha t c o sts of corr ectly o
incor rectly classified in sta nces ma y v ary gre atly across dif ferent ph ysic a l sender s tha t are aggre g at
into the amalg ama ted play e r .
Since the theor etical costs of bo th players depend on th e test dis trib ution, th e y can, for all practical pu r poses, not b e calc ulate d. Hence, we focus on a r e gula r iz ed, empir ical c ou nterpart of the th eˆ + 1( D, D˙ ) of the data genΩ
oretical costs base d on th e tr ain ing sample
D. T he empiric al c o unterpart
Ω+ 1( p, p˙) penalizes the di v er gence between tr a ining D
{ ( xi , yi ) } ni = 1
er ator’ s re gula rizer
sa=mple
n
and a pertur bate d trainin g samp
D˙ = {le( x˙ i , yi ) } i = 1 that w ould be the outc ome of a pp ly ing the transfor ma tion tha t translate
s p˙ to sampleD. The learner ’ s cost function, instead of inte gra tin g
p into
o v ep˙r, sums o v er the ele ments o f th e pertu rbated trainingD˙sample
. The players’ emp ir ical cost
functions ca n still only b e e v aluated after th e le a rner has committed to par
w and
ame
the
ters
d a ta
genera tor to a transformatioHo
n. w e v er this transfor ma tion needs only be represe nte d in terms of
the ef f ects that it will ha v e on the tr ain ing sample
D. The transformed training s ample
D˙ must not
be mis tak en for te st data; test data are genera tep˙daun
t applic
de r a tio n time after the pla yer s ha v e
committed to their a ctio ns.
2620
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
w and
The empiric al costs incur red by th e predic ti v eh(model
x) = sig nfw ( x) with par ame ters
the shif t f rom
p to p˙ a mount to
θˆ q 1( w, D˙ ) =
θˆ + 1( w, D˙ ) =
n
∑ cq 1,i ℓ q 1( fw( x˙ i ) , yi ) + ρq 1Ωˆ q 1( w) ,
( 1)
∑ c+ 1,i ℓ + 1( fw( x˙ i ) , yi ) + ρ+ 1Ωˆ + 1( D, D˙ ) ,
( 2)
i= 1
n
i= 1
wh e re we ha v e replaced the weighting1n cterms
˙ i , yi ) by c o nstant cost f accvtors
,i > 0 with ∑i cv,i =
v( x
ˆ
˙
Ωq e1(rw) in (1) accounts for the f a ct that
1. The le a rner’ s re gulariz
D does not constitute the test
data itself, b ut is merely a training sample transformed to reflect the test dis trib ution a nd then used
to learn the model parametewrs. Th e tr ade- of f between the empirical loss and the r e gula r iz er is
ρva>rame
controlle d by eac h pla yer ’ s re gulariz a tio n p
0 for ter
v ∈ {q 1, + 1} .
ˆ
θv depend on b oth pla yer s’ actions:
w ∈ W and D˙ ⊆
Note that either player’ s emp ir ical costs
nd
D˙ f o r
X × Y . B e cause of th e pote ntially confl ictin g players’ interests , the dec is ion wpraocess
becomes a non-cooperati v e tw o- pla yer g ame, whic h we
pr edictio
call a n g ame
. In the follo wing
section, we will refer to the
N ash pr edictio n gam
(N PG)
e w hich identifi es the concept o f an optimal
mo v e of the le ar ner and the data generato r under the assumption of simulta ne o usly ac tin g players.
3. The Nash Pr edi ction Game
, D˙ ∗ ) that incur s c o sts
w∗actions
The outcome of a predictio n g ame is one par tic ular combination (of
∗
∗
θˆ v( w , D˙ ) for the players. Each player is a w are that this outc ome is afboth
fected
player
by s ’ ac tio n
and tha t, consequently , the ir potential to choose an ac tio n ca n ha v e an imp a ct on the other pla yer ’ s
decision. In gener al, the re is no action that minimizes one pla yer ’ s cost f un c tio n indepe n de n t o f the
other pla yer ’ s action. In a non-cooper ati v e g ame, the play e rs ar e not allo wed to c o mmunic ate w hile
ma kin g their decis io ns and ther efore the y h a v e no inf orma tion about the other pla yer’ s strate gy . In
this setting, an y conce p t of an optimal mo v e r equir e s additional a ssumptio ns o n ho w th e adv ersa ry
will act.
W e model the decis io n proces w
s ∗f o
and
r D˙ ∗ as a sta tictw o-player g ame w citho mple te in formatio n
. In a static g ame , both players commit to an action simultaneously , with out in formatio n
about their opponent’ s action. In a g amecom
w ithplete info rm ation
, b oth pla ye rs kno w th eir opponent’ s cost f un c tio n and action space .
When θˆ q 1 and θˆ + 1 are kn o w n and
anta gonistic
, th e assumption that th e adv er sary will seek
the greatesta dv anta ge
by inflic ting th egreatestdama geon θˆ q 1 justifi esthe m in im ax
str ate gy
:
ˆ
˙
θ
(
,
)
ar gmin
w ma x
D˙ q 1 w D . H o w e v er , w hen the pla yer s’ cost functio ns ar e n ot antagonis tic, assuming
that th e a dv ersar y will inflic t the gre ate st possible dama ge is o v erly Inste
pessimistic.
a d assuming
that the adv ersa ry ac ts ra tio nally in the se nse of seeking the gre ate st possible personal adv anta ge
leads to the concept ofNaash equilibrium
. An equilibriu m strate gy is a ste ady sta te of the g ame in
wh ich neither pla yer has an inc enti v e to unilate r ally change th eir plan of actio ns.
I n static g ame s , equilibrium strate g ies are called Nash equilibr ia , whic h is wh y we re fer to the
resulting pr edic ti v e modelNash
as pr edic tion gam( N
e PG ) . In a tw o- pla yer g a me , a N ash e q uilibrium is defined a s a pair of ac tio ns such th at no player can b e nefi t f rom changing their action
2621
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
unilate ra lly ; that is,
θˆ q 1( w∗ , D˙ ∗ ) =
θˆ + 1( w∗ , D˙ ∗ ) =
∗
min θˆ q 1( w, D˙ ) ,
w∈ W
∗
min θˆ + 1( w , D˙ ) ,
D˙ ⊆X× Y
wh e re
W andX × Y de n ote the pla ye rs’ action space s.
Ho we v er , a sta tic predictio n g ame ma y not ha v e a N ash equilibriu m, or it ma y possess multiple equilibria. I f ( w∗ , D˙ ∗ ) and ( w′ , D˙ ′ ) are dis tinct Nash equilibria and each pla yer decid es to a ct
( w∗ , D˙ ′ ) and( w′ , D˙ ∗ ) ma y in cur a rbitrarily
acc o r ding to a dif ferent one of them, then c ombinations
high costs for both p layers. Hence, o ne c an ar gue tha t it is ra tio nal for an a d v ersar y to pla y a Nash
equilibriu m only w hen the f ollo w ing assumption is sa tis fied.
Assu mptio n 1The following statem e nts hold:
1. bo th play er s act simultaneously ;
˙ ) defined
θˆ v( w
2. bo th player s ha v e full knowle dg e about both ( em piric al) cost
fu, D
nctio
ns in( 1)
an d(2), and b oth action spaces
W and X × Y ;
3. bo th player s act r atio nal with r e spect to their cost function in the sense of se curing the ir
low e st possib le costs ;
4. a unique Nash equilibriu m e xists .
Wheth er A ssumptions 1 .1 -1.3 are adequate—especially the assumption of simu ltaneous actions—
stron gly depends on the applic a tio n. F or e xample, in some applicatio ns, the data generato r may u nilate rally be able to a cquir e infor ma tion about thefwmode
beforel committing to
D˙ . Such situations
and S chef fe r , 2011). O n the other hand,
are better modeled asStac
a k elb er g comp e(Br
titio¨uckner
n
wh e n the le ar ner is able to treat an y e x ecute d action as part of the trDain
a ning
d update
data the
model w, the setting is better mo dele d as repeated e x ecutions o f a static g ame w ith simu ltaneous
actions. T he a dequate ness of A ssumption 1.4, w hic h we dis c u s s in th e follo w in g sections, depends
on th e chosen loss functions, the c o s t f actor s, and the re gu larizers.
3.1 Exis te n ce of a Na sh E quilib rium
Theorem 1 of B r ¨u c kn e r and Schef fe r ( 2 009) identifies c o nditions und e r whic h a unique Nash e q u
librium e xis ts.K anzo w loc ate d a fla w in the proof o f this theore
T he
m: proof ar g ue s that the
pseudo-Jac o bian can be decomposed into tw o (str ictly) positi v e stable ma tric es by sho w in g th at the
rea l par t of e v er y eige n v a lue of those tw o matr ices isHo
positi
we vv er
e. , th is do e s not generally
imply that th e sum of these matr ices is positi v e sta ble as w e ll sinc e this w ould require a commo n
L yapuno v solution (cfProblem
.
2.2 .6 of H or n and Johnson, 199But
1) e. v en if such a solution
e xists , the positi v e de fi niteness cannot be conclude d fr om the positi v eness of all eigen v alue s as the
pseudo-Jac o bian is gener ally non-symmetr ic.
Ha ving “unpro v en” prior claims , w e w ill no w deri v e suf ficient conditio ns for the e xis tence of a
Nash equilibriu m. T o th is e n d, we first define
x :=
hφ( x ) T , φ( x ) T ,..., φ( x ) T i T ∈ φ( X) n ⊂ Rm∙n,
1
2
n
x˙ :=
hφ( x˙ ) T , φ( x˙ ) T ,..., φ( x˙ ) T i T ∈ φ( ) n ⊂ Rm∙n,
X
1
2
n
2622
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
as long, conca te nated, column v ector s induced by featur φe, mapping
training sample
D = { ( xi , yi ) } ni = 1,
and transformed trainin g sample
D˙ = { ( x˙i , yi ) } ni = 1, r especti v ely . F o r termin olo gic al harmon y , we re φ( Xn) nspace
fer to v ectox˙ras the da ta gene rator’ s actio n w ith cor responding actio
.
W e mak e the follo w in g assumptions on th e action spaces and the c ost functions whic h ena b les
us to state the ma in result on the e xis tence of at le a st on e N a sh equilibrium in Lemma 1.
A s su mptio nThe
2 player s’ cost function s defined in Eq uations 1 a nd 2, and their action
W sets
and φ( X) n satis fy the pr operties:
1. loss functionsℓ v( z, y) with v ∈ {q 1, + 1} ar e con ve x and twic e continuou sly dif fer entiable
∈R
w ith r espec t to
z for all fixed y∈ Y ;
ˆ vsar e uniform ly str ongly con ve x and twice contin uously dif fer entiable with r e Ω
2. r e gula riz er
spect tow ∈ W a ndx˙ ∈ φ( X) n, r espectively ;
3. a c tio n space
W sandφ( X) n ar e non-e m pty , com pact, and con ve x subsets of finite - dim e nsional
E uclidean spaceRm
s and Rm∙n, r espectively .
( w∗ , x˙ ∗ ) ∈ W × φ( X ) n of the N ash
Lemma 1 Under Assum ption 2, at least on e equilibrium point
pr edictio n ga me defined by
min θˆ q 1( w, x˙ ∗ )
min θˆ + 1( w∗ , x˙ )
x˙
w
s.t. x˙ ∈ φ( X) n
s.t. w ∈ W
( 3)
e x is ts .
ℓ v and
er te r ms r esulting fr om lo ss function
Pr o ofE. ach pla yer
v’ s cost functio n is a sum onvloss
ˆ
Ω
re gula rizer v. B y A ssumption 2, th ese lo ss f un c tio ns are con v e x and c o ntinuous, and th e re gu˙ ) and
lariz er s are uniformly strongly con v e x and continuo
Hence,
us. both cost f u nc tioθˆ qns
1( w, x
θˆ + 1( w, x˙ ) ar e contin uous in a ll ar gu ments a n d uniformly str onglywc∈o W
n vand
e xx˙in∈ φ( X) n,
res p e cti v ely
A s. both ac tio n space
Wsand φ( X) n are n on- empty , compact, and con v e x subsets
of fi nite-dimensional Euclidean spaces , a N ash equilibrium e xis ts—see T heore m 4.3 of B asar and
Ols der (1999).
3.2 Un iqueness of t h e Na s h E quilib rium
W e w ill no w deri v e c on ditions f or the unique n e ss of an equilibriu m of the N ash pre d iction g a me
( nin
+ to
defined in (3). W e fi r st ref ormulate the tw o-player g ame
1) -an
pla yer g ame . I n Lemma 2,
w e then prese n t a suf fi cient condition for th e uniquenes s of the N a sh equilibrium in this g ame, and
by applyin g Proposition 4 and Lemma 5-7 we v erify whether this condition is me t. F in ally , we state
the main re sult in T heor em 8: T he Nash equilibr ium is un iq ue un de r cer ta in pr o pe rties of the lo ss
functions, the re gulariz e rs, and th e c o st f actor s whic h all can be v er ified easily .
) n, space
T akin g into account the C ar te sian product str uctu re of the da ta gener ator’ φs( X
action
T
it is not dif ficult to see tha( wt ∗ , x˙ ∗ ) with x˙ ∗ = _x˙ ∗1T ,..., x˙ ∗nT _ and x˙ ∗i := φ( x˙ i∗ ) is a solu tio n of the
2623
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
∗ ∗
( wif,
, x˙ 1,..., x˙ ∗n) is a N as h equilibrium of the
( n + 1) -pla yer g a me
tw o-player g a me if , and on ly
defined by
min θˆ q 1( w, x˙ )
w
s.t. w ∈ W
min θˆ + 1( w, x˙ )
∙∙∙
min θˆ + 1( w, x˙ )
s.t. x˙ 1 ∈ φ( X)
∙∙∙
s.t. x˙ n ∈ φ( X)
x˙ 1
x˙ n
,
( 4)
θˆ + 1nand min imiz ing th is functio n w ith
wh ich r esults f rom (3) by r epeating
n times th e c ost f unctio
respe ct to
x˙ i ∈ φ( X) f o ri = 1,..., n. T he n the
pseudo-gr adient
(in the sense of R osen, 1965) of the
g ame in (4) is defi ne d by
gr ( w, x˙ ) :=
r 0Ñwθˆ q 1( w, x˙ )
r 1Ñx˙ 1 θˆ + 1( w, x˙ )
r 2Ñx˙ 2 θˆ + 1( w, x˙ )
..
.
ˆ
r nÑx˙ n θ+ 1( w, x˙ )
∈ Rm+ m∙n,
( 5)
w ith a n y fix ed v ec
r =[
torr 0, r 1,..., r n]T w herer i > 0 for i = 0,..., n. The deri v a ti v ger —tha
o f t is ,
the pseudo-J acobian
of (4) — is gi v en by
Jr ( w, x˙ ) = Λr
_ Ñ2w,w θˆ q 1( w, x˙ ) Ñ2w,x˙ θˆ q 1( w, x˙ ) _
,
Ñ2x˙ ,w θˆ + 1( w, x˙ ) Ñ2x˙ ,x˙ θˆ + 1( w, x˙ )
w he re
Λr :=
r 0I m 0
0 r 1I m
..
..
.
.
0
0
∙∙∙
∙∙∙
..
.
0
0
..
.
∈ R( m+ m∙n) × ( m+ m∙n) .
( 6)
( 7)
∙ ∙ ∙ r nI m
N ote that th e pseudo-gradient
gr and the pseudo-Jac ob Jian
r e xis t w he n Assumptio n 2 is satisfied. The abo v e definition of th e pseudo-Jac ob ian enables us to state the follo wing result about the
uniqueness of a Nash equilibr ium.
Lemma 2 Let Assum ption 2 hold and suppo se th er e e xists a fixed
r =[
vecto
r 0, r 1r,..., r n]T w ith
x˙ ) ia
r i > 0 for all i = 0, 1,..., n suc h that the corr e spo ndin g pseudo-JJar ( cw,ob
is n
positive definite
n
for all ( w, x˙ ) ∈ W × φ( X) . Then the Nash pr edictio n gam( e3)in
has a unique equilibrium.
Pr o of The
.
e xis tence of a N a sh e qu ilibr ium follo ws from L emma
R ec1.
all fr om our pre vious
dis cussio n that th e origina l Nash g ame in ( 3 ) has a unique solution if, and only if , the g ame from ( 4 )
w ith one le ar ner and
n d a ta genera tors admits a unique solution.
In vie w of Theorem 2 of Rosen
gr is strictly monotone ; th at is, if
(1965), the latter attains a unique solution if th e pseudo-gradient
for all a ctio ns
w, w′ ∈ W andx˙ , x˙ ′ ∈ φ( X) n, th e in equality
t gr ( w, x˙ ) q gr ( w′ , x˙ ′ ) _T
__ w _ _
q
x˙
w′ __
> 0
x˙ ′
holds. A suf ficient conditio n for th is pseudo-gr adie nt being stric tly monotone is th e positi v e defiJr (se e, e.g ., Theorem 7.11 and T he o r em 6, r especti v e ly , in G eig er
niteness of the pseudo-Jacobian
2624
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
and Kanzo w , 1999; Rosen, 1965 ) .
T o v e rif y whether th e po s iti v e definiteness condition of L emma 2 is satisfi e d, we first der i v e the
Jr ( w, x˙ ) . W e subsequently decompose it into a sum of thre e ma trices and a n a lyze
pseudo-Jac o bian
cq 1,i
the definiteness of these matr ices for th e par ticula r choicerowith
f v ector
r 0 := 1, r i := c+ 1,i > 0 f or
all i = 1,..., n, with corr esponding matr ix
Im
0
..
.
0
Λr :=
0
cq 1,1
c+ 1,1 I m
..
.
0
∙∙∙
∙∙∙
..
.
∙∙∙
0
0
..
.
.
( 8)
cq 1,n
c+ 1,n I m
This fi nally p r o vides us with suf ficie nt conditions w hic h ensure the uniqueness of the N ash equilibriu m.
3.2.1 D E R I V A T I O N O F T H E P S E U D O - J A C O B I A N
( z, y) a ndℓ ′′v ( z, y) the first and second de ri v ati v e of the
Through out this sec tio n, we denoteℓ ′vby
ma ppin ℓgv( z, y) with respect toz ∈ R and use the a bb r e viatio ns
ℓ ′v,i :=
ℓ ′v( x˙ Ti w, yi ) ,
ℓ ′′v,i :=
ℓ ′v′ ( x˙ Ti w, yi ) ,
for both playersv ∈ {q 1, + 1} andi = 1,..., n.
T o sta te the p s eudo - Jacobia n for th e emp ir ical c o sts gi v en in (1) and (2) , we first der i v e their
first-order partial deri v a ti v es,
Ñwθˆ q 1( w, x˙ ) =
n
′
cq 1,i ℓ q 1,i x˙ i + ρq 1ÑwΩˆ q 1( w) ,
∑
=
( 9)
i 1
Ñx˙ i θˆ + 1( w, x˙ ) =
′
ˆ + 1( x, x˙ ) .
c+ 1,i ℓ + 1,i w + ρ+ 1Ñx˙ i Ω
( 10)
This allo w s us to calcula te the entrie s of th e pseudo-Ja cobia n gi v en in (6),
Ñ2w,w θˆ q 1( w, x˙ ) =
n
′′
∑ cq 1,i ℓ q 1,i x˙ i x˙ Ti + ρq 1Ñ2w,wΩˆ q 1( w) ,
i= 1
Ñ2w,x˙ i θˆ q 1( w, x˙ )
Ñ2x˙ i ,w θˆ + 1( w, x˙ )
=
′′
′
cq 1,i ℓ q 1,i x˙ i wT + cq 1,i ℓ q 1,i I m,
=
c+ 1,i ℓ + 1,i w ˙xTi + c+ 1,i ℓ + 1,i I m,
Ñ2x˙ i ,x˙ j θˆ + 1( w, x˙ ) =
′′
′
ˆ + 1( x, x˙ ) ,
δi j c+ 1,i ℓ ′′+ 1,i wwT + ρ+ 1Ñ2x˙ i ,x˙ j Ω
δi j d e no tes Kroneck er’ s delta whic hi equals
w he re
is 1 if j a n d 0 other w ise.
W e can e xpress the se equatio ns mor e c ompact as ma trix equations.
There fore,we use the
Λr as defined in (7) and set
Γ v := diag( cv,1ℓ ′′v,1,..., cv,nℓ ′′v,n) . A dditionally , we define
dia gonal ma trix
T
X˙ ∈ Rn× m as the ma trix w ith rox˙w
s x˙ Tn , andn matric esW i ∈ Rn× m with a ll entrie s set to zero
1 ,...,
2625
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
T
e xce pt f or th
i -th
e ro w whic h is se twto
. T hen,
Ñ2w,w θˆ q 1( w, x˙ ) =
Ñ2w,x˙ i θˆ q 1( w, x˙ ) =
Ñ2x˙ i ,w θˆ + 1( w, x˙ ) =
Ñ2x˙ i ,x˙ j θˆ + 1( w, x˙ ) =
ˆ q 1( w) ,
X˙ T Γ q 1X˙ + ρq 1Ñ2w,w Ω
′
X˙ T Γ q 1W i + cq 1,i ℓ q 1,i I m,
′
W Ti Γ + 1X˙ + c+ 1,i ℓ + 1,i I m,
ˆ + 1( x, x˙ ) .
W Ti Γ + 1W j + ρ+ 1Ñ2x˙ i ,x˙ j Ω
Hence, the p se ud o- Jacobia n in ( 6) can be sta ted as follo ws,
T
_ X˙
0 ∙ ∙ ∙ 0 _ _ Γ q 1 Γ q 1 __
Γ+ 1 Γ+ 1
0 W1 ∙ ∙ ∙ Wn
ˆ q 1( w)
ρq 1Ñ2w,w Ω
cq 1,1ℓ ′q 1,1I m
ˆ + 1( x, x˙ )
ρ+ 1Ñ2x˙ 1,x˙ 1 Ω
c+ 1,1ℓ ′+ 1,1I m
Λr
..
..
.
.
′
2
ˆ + 1( x, x˙ )
ρ+ 1Ñx˙ n,x˙ 1 Ω
c+ 1,nℓ + 1,nI m
Jr ( w, x˙ ) = Λr
X˙
0 ∙∙∙ 0 _
+
0 W1 ∙ ∙ ∙ Wn
∙∙∙
cq 1,nℓ ′q 1,nI m
ˆ + 1( x, x˙ )
∙ ∙ ∙ ρ+ 1Ñ2x˙ 1,x˙ n Ω
.
..
..
.
ˆ + 1( x, x˙ )
∙ ∙ ∙ ρ+ 1Ñ2x˙ n,x˙ n Ω
.
W e no w aim a t decomposing the rig ht- hand e xpre ssion in order to v er if y the definiteness of the
pseudo-Jacobian.
3.2.2 D E C O M P O S I T I O N O F T H E P S E U D O - J A C O B I A N
T o v er if y the positi v e definiten e ss of the pseudo-Jac ob ian, we f urther decompose the se cond summa nd of the abo v e e xpression into a positi v e semi-definite and a str ictly positi v e definite matr ix .
Therefor e, le t us d e no te th e smallest eige n v a lues of the H essians of th e r e gula r iz er s on the corre
sponding ac tio n spaces
W andφ( X) n by
λ q 1 :=
λ + 1 :=
ˆ q 1( w) _,
in f λ min t Ñ2w,w Ω
( 11)
ˆ + 1( x, x˙ ) _,
inf λ min t Ñ2x˙ ,x˙ Ω
( 12)
w∈ W
x˙ ∈ φ( X) n
λ mi n( A ) denotes the sma llest e igen v alue of the symmetricA .ma trix
w he re
R e mark 3N ote that the min imum in ( 11) and ( 12) is attained and is stric tly positi v e: T he ma pping
k× k
λ min : M k× k → R is conca v e on the set of symmetr ic maMtric
eofs dimensionk × k (cf. E xample 3.1 0 in Bo yd and Vandenber ghe, 2004), a n d in par tic ular , it th erefor e follo w s th at this ma pping
ˆ q 1( w) and
is continuous. Further more, th e ma ppings
uq 1 : W → M m× m with uq 1( w) := Ñ2w,w Ω
∙
×
∙
ˆ + 1( x, x˙ ) are continuo us (f o r an y xfi) xbye A
d ssump u+ 1 : φ( X) n → M m n m n with u+ 1( x˙ ) := Ñ2x˙ ,x˙ Ω
tion 2. H ence , the ma ppings
w 7→λ mi n( uq 1( w)) a n dx˙ 7→λ min( u+ 1( x˙ )) a re a ls o contin uous sin ce
λ min ◦ uv of the continuo us functions
λ min anduv for v ∈ { q 1, + 1} .
each is pr ecis ely the composition
T aking into account that a c on tinuous mapp in g on a non-empty compac t se t a tta in s its minimu m, it
follo ws th at there e xist e le men
w∈W
ts andx˙ ∈ φ( X) n such that
λq 1 =
λ+ 1 =
ˆ q 1( w) _,
λ min t Ñ2w,w Ω
ˆ + 1( x, x˙ ) _.
λ min t Ñ2x˙ ,x˙ Ω
Moreo v er , sin ce the H e ssia ns of the r e gula r iz er s are positi v e d e fi nite by Assumption 2 , w e see
λ v > 0 holds forv ∈ {q 1, + 1} .
3
2626
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
By the abo v e defi nitions, we ca n decompose the re gulariz e rs’ H es sia ns as f ollo w s,
Ñ2w,w Ωq 1( w) =
Ñ2x˙ ,x˙ Ω+ 1( x, x˙ ) =
λ q 1I m + ( Ñ2w,w Ωq 1( w) q λ q 1I m) ,
λ + 1I m∙n + ( Ñ2x˙ ,x˙ Ω+ 1( x, x˙ ) q λ + 1I m∙n) .
A s the r e gula r iz er s ar e stric tly λcon
x,
v e so th at f or ea ch of the abo v e equatio ns the
v arveepositi
first summand is positi v e definite and th e sec o nd summa nd is positi v e semi-definite.
Pr op osition 4The p s eudo - J ac o bia n has th e r e p r esentation
( 1)
( 2)
( 3)
Jr ( w, x˙ ) = Jr ( w, x˙ ) + Jr ( w, x˙ ) + Jr ( w, x˙ )
( 13)
wh e r e
( 1)
Jr ( w, x˙ ) = Λr
( 2)
Jr ( w, x˙ ) = Λr
( 3)
Jr ( w, x˙ ) = Λr
T
_ X 0 ∙ ∙ ∙ 0 _ _ Γ q 1 Γ q 1 __ X 0 ∙ ∙ ∙ 0 _
,
Γ+ 1 Γ + 1
0 W1 ∙ ∙ ∙ Wn
0 W1 ∙ ∙ ∙ W n
ρq 1λ q 1I m cq 1,1ℓ ′q 1,1I m ∙ ∙ ∙ cq 1,nℓ ′q 1,nI m
0
c+ 1,1ℓ ′+ 1,1I m ρ+ 1λ + 1I m ∙ ∙ ∙
..
..
..
..
.
.
.
.
∙ ∙ ∙ ρ+ 1λ + 1I m
0
c+ 1,nℓ ′+ 1,nI m
,
ˆ q 1( w) q ρq 1λ q 1I m
_ ρq 1Ñ2w,w Ω
_
0
.
2 ˆ
ρ+ 1Ñx˙ ,x˙ Ω+ 1( x, x˙ ) q ρ+ 1λ + 1I m∙n
0
( 1)
( w, xices
˙ ),
The abo v e proposition r esta tes the pseudo-Jacobian as a sum of the threeJrmatr
( 2)
( 3)
( 1)
( 2)
′′
Jr ( w, x˙ ) , a n dJr ( w, x˙ ) . Matrix Jr ( w, x˙ ) conta ins allℓ v,i te r msJr, ( w, x˙ ) is a composition of
( 3)
˙ ) contains th e H essians of the re gulariz e rs w her e th e d iagonal
scaled id entity matr ices, Jand
r ( w, x
entrie s are reduced ρ
by
W .e f urther analy ze these ma tric e s in the
q 1λ q 1 and ρ+ 1λ + 1, r especti v ely
follo wing se ctio n.
3.2.3 D E FI N I T E N E S S O F T H E S U M M A N D S O F T H E P S E U D O - J A C O B I A N
, x˙ ) is positi
Jr ( w
Recall, that w e w a n t to in v estig ate w hethe r th e pseudo-J
acobia
n
v e definite f or
( 1)
( 2)
n
( w, x˙ ) ∈ W × φ( X) . A suf ficie nt condition is tha
each pair of ac tio ns
Jr t ( w, x˙ ) , Jr ( w, x˙ ) , and
( 3)
Jr ( w, x˙ ) a re positi v e semi-definite and at least on e of thes e matr ices is positi v eFrom
definite.
( 3)
( 2)
the definition ofλ v, it b e come s appar entJthat
In addition, Jr ( w, x˙ )
r is positi v e semi-defin ite.
ob viously becomes positi v e definite f o r suf ficie ntlyρvlaras,
gein this case, th e main d iagonal
( 1)
domin a te s th e non-diagonal entries. Finally
Jr ( w, x˙ ) become s positi v e semi-definite under some
mild conditions on the loss functions.
I n the f ollo w ing w e de ri v e th ese conditions, sta te lo w er bound s on the r e gula r iz ation para me
ρv, and pr o vide for ma l proof s o f the abo v e c la ims. T he refore , w e mak e the follo wing assumptions
ˆ v for v ∈ {q 1, + 1} . I n stances of th ese functions
ℓ v and the re gu larizers
Ω
on the loss f unctio ns
satisfying Assumptio ns 2 and 3 w ill be g i v e n in Section 5. A dis cussio n on the pra ctical imp lications
of these assumptions is gi v en in th e subsequent section.
T
A s su mptio nFor
3 a llw ∈ W and x˙ ∈ φ( X) n with x˙ = _x˙ T1 ,..., x˙ Tn _ the fo llo wing conditions ar e
satisfied:
2627
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
1. the se cond deriva tives of the loss functio ns ar e equal∈foYr and
all y i = 1,..., n,
ℓ ′′q 1( fw ( x˙ i ) , y) = ℓ ′′+ 1( fw ( x˙ i ) , y) ,
2. the play er s’ r e gularizatio n par am eter s satis fy
ρq 1ρ+ 1 > τ2
1
cT c+ 1,
λ q 1λ + 1 q 1
λ q 1, λ + 1 a r e the smalle st eig en values of the Hessians of the r e gulariz er s spe cified
w her e
T
) ( 12), cv =[ cv,1, cv,2,..., cv,n] , and
in ( 11 and
τ=
1_′
ℓ q 1( fw ( x) , y) + ℓ ′+ 1( fw ( x) , y) _
_
_,
( x,y) ∈ φ( X) × Y 2
sup
( 14)
3. for a ll i= 1,..., n either b oth playe r s have equa l in stance-spec ific costq facto
1,i = cr+ s,
1,i ,c
x˙ j of
or th e partial derivativÑex˙ i Ω+ 1( x, x˙ ) of th eda ta generato rr’esgular iz e r is independent
6
for all j i.
=
× Y is assume d to be
Notice, thatτ in Equatio n 14 can be c ho se n to be finite a s φ
th( Xe) set
′
(
(
, y)s andℓ ′+ 1( fw ( x) , y)
compac t, and consequently , the v alues of b oth continuous ℓma
ppin
f
x
q 1 w )g
are finite for all( x, y) ∈ φ( X) × Y .
Lemma 5 Let ( w, x˙ ) ∈ W × φ( X) n be arbitr arily given.U nder Assum ptions 2 and 3, th e m atrix
( 1)
Λr defined
Jr ( w, x˙ ) is sym metr ic po s itive se m i-definite ( b ut not po sitiv e definite)
for as in Equation 8.
Λrof
Pr o ofTh
. e special str uctu re
, X˙ , a n W
d i gi v es
( 1)
Jr ( w, x˙ )
=
T
_ X˙
0 ∙ ∙ ∙ 0 _ _ r 0Γ q 1 r 0Γ q 1 __ X˙
0 ∙∙∙ 0 _
,
¡Γ + 1 ¡Γ + 1
0 W1 ∙ ∙ ∙ Wn
0 W1 ∙ ∙ ∙ Wn
ℓ ′′q 1,i = ℓ ′′+ 1,i and the definitio rn0 = 1, r i =
w ith¡ := diag( r 1,..., r n) . From the assumption
Γ q 1 = ¡Γ + 1, such th at
for all i = 1,..., n it f o llo ws that
( 1)
Jr ( w, x˙ )
cq 1,i
c+ 1,i
>0
T
_ X˙
0 ∙ ∙ ∙ 0 _ _ Γ q 1 Γ q 1 __ X˙
0 ∙∙∙ 0 _
=
,
Γq 1 Γ q 1
0 W1 ∙ ∙ ∙ W n
0 W1 ∙ ∙ ∙ W n
( 1)
T
Jr ( w, x˙ ) z ≥ 0 holds f or
w hich is ob viously a sy mmetric matrFixur
. thermor e, w e sho wzthat
+
∙
m mn
all v ector sz ∈ R
. T o this end,le t z be arbitrar ily gi v en,
and partition th is v ecto r zin=
T
_zT0 , zT1 ,..., zTn _ w ithzi ∈ Rm for all i = 0, 1,..., n. Then a simple calc ulatio n sho w s that
( 1)
zT Jr ( w, x˙ ) z =
n
_ zT x + zT w_
0 i
i
∑
=
2
′′
cq 1,i ℓ q 1,i ≥ 0
i 1
sinceℓ ′′q 1,i ≥ 0 for all i = 1,..., n in vie w of the assume d con v e xity of maℓ pping
q 1( z, y) . H enc e,
( 1)
Jr ( w, x˙ ) is positi v esemi-defin ite. This matr ix c annotbe p ositi v definite
e
sin cew e ha v e
T ( 1)
z Jr ( w, x˙ ) z = 0 f or th e par ticula r v ector
z defined byz0 := q w andzi := xi for all i = 1,..., n.
2628
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
Lemma 6 Let ( w, x˙ ) ∈ W × φ( X) n be arbitr arily given.U nder Assum ptions 2 and 3, th e m atrix
( 2)
Jr ( w, x˙ ) is positive definite foΛrr defined as in E quation 8.
( 2)
( w, xtrix
˙ ) to be
Pr o ofA. suf fi c ie nt and necessa ry condition f o r the (possibly asymmetr Jic)
r ma
positi v e definite is tha t th e Hermitia n ma trix
( 2)
( 2)
H ( w, x˙ ) := Jr ( w, x˙ ) + Jr ( w, x˙ ) T
1
is positi v e defi nite, tha t is, all eig en v alues
H ( w,of
x˙ ) are positi v eLet
. Λr2 denote the square root
1
2
of Λr w hich is defin e d in su c h a w ay tha t the dia gonal eleΛments
of squar e roots of the
r are the
1
q 12
corr esponding diagonal e le menΛts
F ur thermo re, we denoteΛby
the in v er seΛofr2 . Th e n,
r . of
r
by Sylv e ste r’ s la w of inertia, th e matr ix
1
1
q
q
H¯( w, x˙ ) := Λr 2 H ( w, x˙ ) Λr 2
( w, x˙ )as
has the s ame n umb er of positi v e, zero, and ne g ati v e eige n vHalues
itself.
matr ix
( 2)
Hence,Jr ( w, x˙ ) is positi v e definite if, and only if , a ll eigen v alu es of
¯( w, x˙ ) =
H
=
1
q
1
Λr 2 Λr
q
Λr
=
1
q
q
( 2)
( 2)
Λr 2 _ Jr ( w, x˙ ) + Jr ( w, x˙ ) T _ Λr 2
1
2
ρq 1λ q 1I m cq 1,1ℓ ′q 1,1I m ∙ ∙ ∙ cq 1,nℓ ′q 1,nI m
0
c+ 1,1ℓ ′+ 1,1I m ρ+ 1λ + 1I m ∙ ∙ ∙
q1
Λr 2 +
..
..
..
..
.
.
.
.
∙ ∙ ∙ ρ+ 1λ + 1I m
c+ 1,nℓ ′+ 1,nI m
0
ρq 1λ q 1I m c+ 1,1ℓ ′+ 1,1I m ∙ ∙ ∙ c+ 1,nℓ ′+ 1,nI m
0
cq 1,1ℓ ′q 1,1I m ρ+ 1λ + 1I m ∙ ∙ ∙
q1
Λr Λ r 2
..
..
..
..
.
.
.
.
′
ℓ
∙
∙
∙
ρ
λ
0
cq 1,n q 1,nI m
+ 1 + 1I m
∙∙∙
2ρq 1λ q 1I m
c˜1I m
c˜nI m
ρ
λ
∙
∙
∙
c˜1I m
2 + 1 + 1I m
0
..
..
..
..
.
.
.
.
∙
∙
∙
ρ
λ
0
2 + 1 + 1I m
c˜nI m
√
λ of this ma trix satisfi es
are positi v e, w her
e ˜ cq 1,i c+ 1,i ( ℓ ′q 1,i + ℓ ′+ 1,i ) . Each eigen v alue
ci :=
¯( w, x˙ ) q λ I m+ m∙n_v = 0
tH
T
= _vT0 , vT1 ,..., vTn _ with vi ∈ Rm for i = 0, 1,..., n. This eig enfor the cor responding eig en vvector
v alu e e q ua tio n ca n be re written block-wis e as
n
( 2ρq 1λ q 1 q λ ) v0 +
c˜i vi
∑
=
=
0,
( 15)
0 ∀ i = 1,..., n.
( 16)
i 1
( 2ρ+ 1λ + 1 q λ ) vi + c˜i v0 =
2629
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
v0t= 0. Then ( 15)
T o compute all po s sible eig en v alues, we consider twFir
o cases.
st, assume tha
and ( 16) r educe to
n
∑ c˜i vi = 0
and ( 2ρ+ 1λ + 1 q λ ) vi = 0 ∀i = 1,..., n.
i= 1
Sincev0 = 0 and eigen v ector
v 6 0, at least on vei is non-zer o.This implies tha λt = 2ρ+ 1λ + 1 is
n
an e igen v alue. Using the f a= ct th at the null space of the linearv 7→
ma∑pping
˜i vi has dimension
i= 1 c
( n q 1) ∙ m ( w e ha n
∙
,...,
v em de g r ees o f f reedom countin g all components
v1
of
vn andm equations
λ = 2ρ+ 1λ + 1 is a n eige n v a lue of multiplicity
( n q 1) ∙ m.
in ∑ni = 1 c˜i vi = 0), it f ollo w s that
λ 6t 2ρ+ 1λ + 1
No w w e conside r the sec o nd ca se vw0 6
here
0. W e may f urther assume tha
=
= multiplicity). W e then
(sin ce othe rw ise w e get the same eigen v alue as befor e, just with a dif fer ent
get from (16) tha t
c˜i
vi = q
v0 ∀i = 1,..., n,
( 17)
ρ
λ
2 +1 +1q λ
and when substituting th is e xpressio n into (15) , we ob tain
n
( 2ρq 1λ q 1 q λ ) q
c˜2
∑ 2ρ+ 1λ +i 1 q λ
!
v0 = 0.
i= 1
T aking into account that
v0 6 0, th is imp lies
=
0 = 2ρq 1λ q 1 q λ q
n
1
c˜2i
2ρ+ 1λ + 1 q λ i∑
=1
and, ther efore,
0 = λ 2 q 2( ρq 1λ q 1 + ρ+ 1λ + 1) λ + 4ρq 1ρ+ 1λ q 1λ + 1 q
n
∑ c˜2i .
i= 1
The roots of this quadra tic equation are
λ = ρq 1λ q 1 + ρ+ 1λ + 1 ±
s
n
( ρq 1λ q 1 q ρ+ 1λ + 1) 2 +
c˜2i ,
∑
=
( 18)
i 1
¯( w, x˙of
) , each of multiplicitym since th ere are precisely
and th ese ar e the remaining e igen v alues
H
v0 6rs0 whereas the other v ector
vi (is= 1,..., n) a re uniq u e ly defined
m lin e arly independent v ecto
=
by ( 1 7) in this case . I n particular
, this implie s that the dimensions of all th ree eigenspace s togeth er
=
is ( n q 1) m + m + m =( n + 1) m, hence oth er eigen v alu es c anno S
t einxis
cet.th e e igen vλalue
2ρ+ 1λ + 1 is positi v e by R emark 3, it re ma ins to sho w that th e roots in (18) are positi v e a s well. B y
Assumptio n 3, w e ha v e
n
∑ c˜2i =
i= 1
n
′
′
cq 1,i c+ 1,i ( ℓ q 1,i + ℓ + 1,i ) 2 ≤
∑
=
4τ 2cTq 1c+ 1 < 4ρq 1ρ+ 1λ q 1λ + 1,
i 1
T
wh e re
cv =[ cv,1, cv,2, ∙ ∙ ∙, cv,n] . This inequality and Equ a tio n 18 gi v e
λ =
ρq 1λ q 1 + ρ+ 1λ + 1 ±
s
n
( ρq 1λ q 1 q ρ+ 1λ + 1) 2 +
c˜2i
∑
=
i 1
>
ρq 1λ q 1 + ρ+ 1λ + 1 q q ( ρq 1λ q 1 q ρ+ 1λ + 1) 2 + 4ρq 1ρ+ 1λ q 1λ + 1 = 0.
2630
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
¯( w, x˙ ) a repositi v ema
A s all eig en v alues
of H
, trixH ( w, x˙ ) and, consequentlyals
, othe ma trix
( 2)
(
,
)
˙
Jr w x a re positi v e defi nite.
Lemma 7 Let ( w, x˙ ) ∈ W × φ( X) n be arbitr arily given.U nder Assum ptions 2 and 3, th e m atrix
( 3)
Λr d e fin e d as in Equ ation 8.
Jr ( w, x˙ ) is positive semi- definite for
Pr o ofBy
. A s su mp tio n 3, either both play e rs ha v e equal in sta nce- specific costs , or th e partial gra d i
ˆ + 1( x, x˙ ) of the se nd e r’
x˙ j of
ent Ñx˙ i Ω
resgulariz e r is independent
for all j 6 i and i = 1,..., n. L e t
=
=
us c on sid er the fi r st case cwhere
c+ 1,i , and consequently
r i 1, f or=a ill= 1,..., n, such that
q 1,i
( 3)
Jr ( w, x˙ ) =
ˆ q 1( w) q ρq 1λ q 1I m
_ ρq 1Ñ2w,w Ω
_
0
.
2 ˆ
ρ+ 1Ñx˙ ,x˙ Ω+ 1( x, x˙ ) q ρ+ 1λ + 1I m∙n
0
The e igen v alueofs th is block dia gonal matr ix ar e the eigen v alu es
of the ma trix
2
2 ˆ
ˆ
ρq 1( Ñw,w Ωq 1( w) q λ q 1I m) together with those ofρ+ 1( Ñx˙ ,x˙ Ω+ 1( x, x˙ ) q λ + 1I m∙n) . From the defi, + 1} .
nition of λ v in ( 11) and (12) follo ws that these matr ices are positi v e semi-definite
v ∈ { q 1for
( 3)
H e nce,
Jr ( w, x˙ ) is positi v e semi-definite a s well.
ˆ + 1that
Ñx˙ i Ω
( x, x˙ ) is indepe n de n x˙t jof
No w , let us c o nsid er th e second ca se where we assume
2
ˆ
for all j 6 i . Hence,Ñx˙ i ,x˙ j Ω+ 1( x, x˙ ) = 0 for all j 6 i such th at
=
=
˜ q1
ρq 1Ω
∙∙∙
0
0
cq 1,1
˜
ρ
Ω
∙
∙
∙
0
0
+ 1 c+ 1,1 + 1,1
( 3)
,
Jr ( w, x˙ ) =
..
..
..
..
.
.
.
.
cq ,
˜ + 1,n
∙ ∙ ∙ ρ+ 1 c 1 n Ω
0
0
+ 1,n
˜ q 1 := Ñ2w,w Ω
ˆ q 1( w) q λ q 1I m and Ω
˜ + 1,i = Ñ2x˙ i ,x˙ i Ω
ˆ + 1( x, x˙ ) q λ + 1I m. Th e eigen v alu es of this
Ω
w he re
˜ q cks
ρqb1Ω
block d iagonal matr ix ar e ag a in the union of the e igen v alues of the sin gle
lo
1 and
cq 1,i
˜
˜
ρ+ 1 c+ 1,i Ω+ 1,i for i = 1,..., n. As in th e first p a rt of the proof,
Ωq 1 is positi v e semi-definite.
The
2 ˆ
2
ˆ
Ñx˙ i ,of
˙ ) are th e unio n of a ll eig en v alues
˙ ) . H ence, each of
eige n v a luesÑof
x˙ ,x˙ Ω+ 1( x, x
x˙ i Ω+ 1( x, x
˜
λ + 1 and
Ω+ 1,i is positi v e se mi- definite. The
these eig en v alues is lar ger or equal
to th us, eac h block
cq 1,i
ρq 1 > 0 a n ρ
f a ctors
d+ 1 c+ 1,i > 0 are multipliers th at do not af f ect the definiteness of the blocks, and
( 3)
consequentlyJr, ( w, x˙ ) is positi v e se mi- definite a s well.
The pr e vious results guarantee the e x iste nce and uniq u e ness of a N ash equilibrium under the
state d a ssumptio ns.
Th eor e m Let
8 Assum ptions 2 and 3 hold.
Then th e N ash pr edic tion gam(3)
e in
has a unique
equilibrium.
Pr o ofThe
. e xiste nce of a n equilib r ium of th e N ash pre d iction g a me in (3) follo w s from Lemma 1 .
Λr such
Jr ( w, x˙ )
Proposition 4 and Lemma 5 to 7 imply th at th ere is a positi v e diago na
l ma trix
th at
w, x˙ ) ∈ W × φ( X) n. H e n c e, the uniq u e ness follo ws from Lemma 2.
is positi v e definite for (all
2631
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
3.2.4 P R A C T I C A L I M P L I C A T I O N S O F A S S U M P T I O N S 2 A N D 3
Theorem 8 guarantees the uniqueness of the equilibriu m on ly if the cost f u nc tio ns of le a rner and
data gener ator r ela te in a certain w a y th at is defined by Assumption
I n a 3.
d dition, e ach o f the
cost function s has to sa tis f y A ssump tio n 2. This se ctio n d iscusses the practical implic a tio n of th ese
assumptions.
The conditions of A s su mp tio n 2 impo s e rathe r te chnical limitation s on th e cost functions. The
requireme nt of con v e xity is quite ordinar y in the ma chin e learnin g In
c oaddition,
nte xt. the lo ss
function has to b e tw ice continuou sly dif fere n tiable, w hich restric ts the f amily of eligible loss f unctions. Ho w e v er , this condition can still be me t easily; for instance, by smoothe d v ersio ns of the
hinge loss. The second requireme nt of uniformly str ongly con v e x and twic e continu ously dif f er entiable r e gula r iz er s is , ag ain , only a week r estr ictio nThese
in pracrequir
tic e .e me nts are met by
stan da rd re gula rizers; the y occur , for instance, in th e optimizatio n criteria of S VMs and logis tic
re gression. T he requir e me nt of non-empty , compa ct, and c on v e x action spaces ma y be a restric tion
wh e n de alin g with binary or multinomial attrib ute s. H o w e v er , r ela xing th e action spaces of the da ta
genera tor w ould typically r esult in a str a te gy th at is mo re defe nsi v e than w ou ld be optimal b ut still
less def ensi v e tha n a w ors t- cas e strate gy .
The first c on dition of Assumptions 3 requires the cost functio ns of le ar ner a nd data generato r
to ha v e the sa me curv ature
Thiss.is a cruc ia l re str ictio n; if th e cost f unctio ns dif fe r arbitra rily the
Nash equilibrium may not be unique . T he r equir ement of ide n tical curv atu res is me t, for insta nce ,
ℓ ( fw ( x˙ i ) , y) whic h o nly depends on th e term
if one pla yer chooses a loss function
y fw ( x˙ i ) , such as
for SV M’ s h in ge lo ss or th e lo gis ticI nlothss.
is c ase, the condition is me t wh e n the other player
ℓ ( fof
˙ i ) , y) as it a p pr oaches
chooses theℓ ( q fw ( x˙ i ) , y) . T his loss is in some sense th e opposite
w(x
zer o w he n the other goes to infinity and vice v ersa.
In this case , the cost functions may still be
non-antago nistic b e cause the player’ s c ost functions ma y contain instan c e-specific
cv,ci o st f a ctors
that ca n be mode le d in dependently for the p layers.
The second par t of Assumptions 3 couple s the de gre e of r e gula r iz ation of the players. I f the da ta
genera tor produce s insta nce s at a p plication time that dif fer g r eatly from the instances at tr a ining
time, then the learner is required to re gula r iz e strong ly for a u niq ue equilibrium to
If the
e xis t.
dis trib u tions at tr ain ing and applicatio n time ar e more similar , th e equilibrium is unique for smaller
v a lues of th e le a rner’ s re g ularizatio n paramete rs. This r equir e me nt is in line with th e intuition th at
wh e n the tr a ining insta nc es ar e a poor approximatio n o f th e distr ib utio n at applicatio n time, then
impo sin g only w ea k re gu larizatio n on the loss function will re sult in a poor model.
The final r equir ement of A ssumptions 3 is, ag ain, rather a technic al limita tion. It sta tes that the
interdependencies between the pla yer s’ in sta nce- specific costs must be either c aptured by the re gulariz er s, leading to a f ull Hessian, or by cost f ac tors. These cost f ac tors o f le ar ner and data genera tor
ma y dif fe r ar bitrarily if the gradient of th e data gener ator’ s costs of transformingxan
to
ce
i in instan
x˙ i a re independent of all other in stax˙nces
j w ith j 6 i . T his is me t, f or instan c e, by cost models th at
= xbetween
˙i.
only depend on some me a sure of th e dis tance
i andx
4. Fi nding the U ni que N ash Eq uil ibri um
According to Theorem 8, a unique equilibriu m of th e N ash pre dic tion g ame in ( 3 ) e xists for suitable
loss functions and re gularizers. T o find th is equilibrium, w e deri v e and study tw o dis tinct methods:
The fi r st is based on the Nikaid o - Isoda f u nc tio n tha t is c on s tr ucted such that a min imax solution of
this functio n is an equilibrium of th e N ash pre d iction g a me and v ice
T his
v ersa
proble
. m is then
2632
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
solv e d by in e xact lin e search. I n the second approa ch, we r eformulate the N ash predictio n g ame in to
a v ariatio na l in equality problem whic h is solv ed by a modified e xtragradient me thod.
The data gener ator’ s action o f tr ansf ormin g th e input dis trib ution ma nifests in a concaten a tio n of
x˙ ∈ φ( X) n ma pped into the f eatur e space
x˙ i := φ( x˙i ) for i = 1,..., n,
transformed tr ain ing instances
and the le ar ner’ s action is to choose weightwv ∈ector
W o f classifierh( x) = sign fw ( x) with lin e ar
decision f u nc tiofwn( x) = wT φ( x) .
4.1 An Inexact Linesear c h A ppr oach
T o solv e for a N ash equilibrium, we ag a in consid er the g ame from ( 4 ) with onenle
daarne
ta r a nd
genera tors.
A solution of this g ame can be id entifi ed with the help of th e weighte
N ika id
d o-Isoda
, x˙ ns
) ∈ W × φ( X) n and ( w′ , x˙ ′ ) ∈
function (Equatio n 19 F
) .or an y tw o c o mb inations of ac( wtio
T
T
W × φ( X) n w ithx˙ = _x˙ T1 ,..., x˙ Tn _ and x˙ ′ = _x˙ ′1T ,..., x˙ ′nT _ , th is f unctio n is the w e ighte d sum of
w togy
w′ andx˙ i to x˙ ′i ,
relati v e cost sa vings th at
n +the
1 pla yers can enjo y by changing fr om strate
( w, x˙ding
) ; thatot is,
respe cti v ely , w hile the other players c on tinue to play accor
J r ( w, x˙ , w′ , x˙ ′ ) := r 0 t θˆ q 1( w, x˙ ) q θˆ q 1( w′ , x˙ ) _ +
n
r i _ θˆ + 1( w, x˙ ) q θˆ + 1( w, x˙ i ) _ ,
∑
=
()
( 19)
i 1
T
w he re
x˙ (i ) := _x˙ T1 ,..., x˙ ′i T ,..., x˙ Tn _ . Let us denote the w e ighte d sum of grea te st possible cost sa v, x˙ ) ∈ W × φ( X) n by
ings w ith r espect to a n y gi v en combination of( w
actions
J¯r ( w, x˙ ) :=
max
( w′ ,x˙ ′ ) ∈ W × φ( X ) n
J r ( w, x˙ , w′ , x˙ ′ ) ,
( 20)
¯( w, x˙ ) , x¯( w, x˙ ) denote s the corre sp ondin g pair of ma ximizer
w he rew
N ote
s. that the ma ximum
˙ , w′ , x˙ ′ ) is
in ( 20) is attained for an( wy, x˙ ) , sinceW × φ( X) n is assume d to be compac tJand
r ( w, x
′ ′
( w , x˙ ) .
continuo us in
( w∗ , x˙ ∗ ) is an equilibrium of th e N a sh predic tion g ame if,
By these defin itions, a combination
∗
∗
J¯r with J¯r ( w∗ , x˙ ∗ ) = 0 f or an y fix ed w eig hts
and only if J¯
, r ( w , x˙ ) is a g lo bal min imum of mapping
r i > 0 and i = 0,..., n, see Proposition 2.1 (b) of v on H eusin g e r and Kanzo w (2009). E qui v alently ,
¯( w∗ , x˙ ∗ ) = w∗ andx¯( w∗ , x˙ ∗ ) = x˙ ∗ .
a N ash equilibrium simu ltaneously satisfies bo th equations
w
The sig n ific ance of this obser v ation is that the equilibriu m problem in (3) c an be ref ormulate d
¯r ( w, x˙ ) . T o solv e this minimization p r obinto a min imization proble m of the continuous ma Jpping
lem, we ma k e use o f C orollary 3.4 (v on H e u sin ger and K a nzo w , 2009). W e
:= 1the weights
r 0set
cq 1,i
=
=
,...,
and r i : c+ 1,i for all i 1
n as in (8), wh ich ensures the ma in condition of C or olla r y 3 .4 ; th at
Jr ( w, x˙ )nin ( 13) ( cf. proof of Theore m 8). Accordin g to
is , the positi v e definiten e ss of the Jacobia
this corollar y , v ector s
¯( w, x˙ ) q w and d+ 1( w, x˙ ) := x¯( w, x˙ ) q x˙
dq 1( w, x˙ ) := w
( w, x˙ ) ∈
form a desce nt direction
d( w, x˙ ) :=[ dq 1( w, x˙ ) T , d+ 1( w, x˙ ) T ]T of J¯r ( w, x˙ ) at an y position
∗ ∗
n
her, x˙e ) = 0), and consequently , the re e xis ts
W × φ( X) ( e xcept f o r the N ash equilibriu m dw( w
t ∈ [0, 1] s u c h that
J¯r ( w + t dq 1( w, x˙ ) , x˙ + t d+ 1( w, x˙ )) < J¯r ( w, x˙ ) .
¯( w, x˙ ) , w
¯( w, x˙ )) are f easible combin atio ns of actions, the con v e x ity of the ac tio n
Since,( w, x˙ ) and( w
(
+
spaces ensure s thwat t dq 1( w, x˙ ) , x˙ + t d+ 1( w, x˙ )) is a feasib le combination for an
t ∈y [0, 1] as
well. The f o llo w ing algorithm e xploits th ese p r operties.
2633
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
A lg orit hm 1ILS: Ine xac t Linese arch S olv er for N ash Predictio n G ames
θˆ v as defined in (1) and (2), and a ctio n spaces
R e q uir eC: ost functions
W andφ( X) n.
1: Sele c t in itial
w( 0) ∈ W , setx˙ ( 0) := x, setk := 0, a nd seleσc ∈t ( 0, 1) andβ ∈ ( 0, 1) .
cq 1,i
2: Set r 0 := 1 andr i := c+ 1,i for all i = 1,..., n.
3: r epeat
( k)
( k) ( k)
4:
¯( k) q w( k) w her ew
¯( k) := ar gmax
˙ , w′ , x˙ ( k) _.
Set dq 1 := w
w′ ∈ W J r t w , x
5:
6:
( k)
( k) ( k)
˙ , w( k) , x˙ ′ _.
Set d+ 1 := x¯( k) q x˙ ( k) w he rex¯( k) := ar g max
x˙ ′ ∈ φ( X) n J r t w , x
Find maxima l step size
t ( k) ∈ _ βl | l ∈ N with
_ (k)
( k)
( k)
J¯r _ w( k) , x˙ ( k) _ q J¯r _ w(k) + t ( k) dq 1, x˙ ( k) + t ( k) d+ 1_ ≥ σ t ( k) dq 1
2
2
( k) 2_
+ d+ 1
2
.
( k)
7:
Set w( k+ 1) := w( k) + t (k) dq 1.
8:
Set x˙ ( k+ 1) := x˙ ( k) + t (k) d+ 1.
( k)
Set k := k + 1.
( )
( q )
10: un t il w k q w k 1
9:
2
+
2
x˙ ( k) q x˙ ( kq 1)
2
≤
2
ε.
The con v e r gence properties of A lg o r ith m 1 are dis c u ss ed by v on H eusinge r a nd Kanzo w ( 20
so we sk ip the details here.
4.2 A Modified Ex t ragradie nt A ppr oach
In A lgor ith m 1, line 4 a n d 5, as well as the lin ese arch in line 6, re qu ire to s o lv e a conca v e max imiz
tion proble m w ithin each iteration.
As th is may become computa tionally dema nding, we deri v e a
second approa ch based on e xtragradient desce
T her
nt.efore, inste a d o f ref ormulatin g th e equilibriu m problem into a minimax proble m, w e directly addre ss the fi r st- or der op tima lity cond itions of
, x˙ ∗ ) s
each pla yer s’ minimization proble m in (4): U nder A ssumption 2, a combination (of
w∗action
T
w ithx˙ ∗ = _x˙ ∗1T ,..., x˙ ∗nT _ satisfies eac h player’ s fi r st- or d e r optimality conditions if , and o nly if , f or
all ( w, x˙ ) ∈ W × φ( X) n th e f o llo w ing inequalities hold,
Ñwθˆ q 1( w∗ , x˙ ∗ ) T ( w q w∗ ) ≥
Ñx˙ i θˆ + 1( w∗ , x˙ ∗ ) T ( x˙ i q x˙ ∗i ) ≥
0,
0 ∀ i = 1,..., n.
A s the joint actio n space of all pla yer
W s× φ( X) n is precisely th e full Cartesia n pr o duct of the
φ( X) , the (w e ighte d) sum of those
le a rner’ s action set
W and then data gener ators’ action sets
indi vidual optimality conditions is a ls o a suf ficie nt and nece ssary optimality condition for the e q ui∗ ∗
( wium
, x˙ ) ∈ W × φ( X) n is a solution of th variational
librium proble m. Hence, a N ash e q uilibr
e
inequality pr oble, m
∗ ∗
gr ( w , x˙ ) T
__ w _ _
q
x˙
w∗ __
≥ 0 ∀( w, x˙ ) ∈ W × φ( X) n
x˙ ∗
( 21)
and vice v ersa (cfProposition
.
7.1 of H a rk er and P ang, 1990).
T he pseudo- g r adie
gr in
nt (21) is
cq 1,i
T
defined a s in (5) with fix ed v ecto
r =[ r 0, r 1,..., r n] w herer 0 := 1 andr i := c+ 1,i for a lli = 1,..., n
2634
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
r ensure
gr ( w, x˙ ) is continu(cf . Equ a tio n 8). U nder Assumptio n 3, th is choic
e ofs th at the mapping
ous and str ictly monoto n e (cf. proof of L emma 2 and T heore m 8). H enc e, th e v ariatio nal inequality
problem in (21) c an b e solv edmbyodifi ed e xtr a gr adie nt descent
( see, f o r instance, Chapte r 7.2 .3
of Geiger a n d Kanzo w , 1999).
Before pre sentin g Algorithm 2, whic h is a n e xtr agr adie nt-based
algor ith m f or the N ash predictio n g ame, le t u s d eL2no
-prte
o jection
the
ofa into th e non - empty ,
compac t, and con v e A
x set
by
ΠA ( a) := ar g′ min
ka q a′ k22.
a ∈A
/ A, this
No tice, tha t ifA := { a ∈ Rm | k ak2≤ κ } is th e clo sed
l 2-ball of ra diusκ > 0 and a ∈
κ
projectio n simp ly reduce s to a r escaling of av to
ecletor
ngth .
Π
Based on this defi nition ofA , w e ca n no w state an itera ti v e me th od ( A lgor ith m 2), w hic h—
apar t from back proje c tio n ste ps—does not require solving an o ptimization problem in e ach iter ation. The proposed algorithm con v er ges to a solu tio n of the v a ria tional ine q uality problem in 21;
that is , th e unique equilibriu m of th e N as h p r edic tion g ame , if Assumptions 2 and 3 hold (cf . Theorem 7.40 of Geige r and K anz o w , 1999).
A lg orit hm 2ED S: Extr agra die nt D esc ent S olv er for N ash P r edic tion Game s
θˆ v as defined in (1) and (2), and a ctio n spaces
R e q uir eC: ost functions
W andφ( X) n.
1: Sele c t in itial
w( 0) ∈ W , setx˙ ( 0) := x, setk := 0, a nd seleσc ∈t ( 0, 1) andβ ∈ ( 0, 1) .
cq 1,i
2: Set r 0 := 1 andr i := c+ 1,i for all i = 1,..., n.
3: r epeat
" d(k) #
__ w( k) _
w( k) _
( k) ( k) __ _
q1
t
n
4:
=
Π
q
,
q
.
˙
Set
:
g
w
x
r
φ
( k)
W × ( X)
x˙ ( k)
x˙ ( k)
d+ 1
5:
Find maxima l step size
t ( k) ∈ _ βl | l ∈ N with
( k)
( k) T
q gr _ w( k) + t ( k) dq 1, x˙ ( k) + t ( k) d+ 1_
6:
7:
Set ste p size of e xtr a g r adie nt
γ( k) := q
Set
t ( k)
_ ¯ k , x¯ k _
2 gr w
( )
¯( k) , x¯( k) _
gr t w
2
( )
T
2
9:
2
+
2
x˙ ( k) q x˙ ( kq 1)
2
≤
2
ε.
2635
( k) 2_
+ d+ 1
" d( k) #
q1
.
( k)
d+ 1
_ w(k+ 1) _
__ w( k) _ ( k)
_
q γ gr t w
¯( k) , x¯( k) _ .
:= Π W × φ( X) n
( k+ 1)
x˙
x˙ ( k)
Set k := k + 1.
( )
( q )
10: un t il w k q w k 1
2
( k)
w(k) _ ( k) " dq 1 #
+t
.
( k)
x˙ ( k)
d+ 1
_ w
¯(k) _ _
Set
:=
x¯( k)
8:
" d( k) #
_ ( k)
q1
≥ σ dq 1
( k)
d+ 1
2
.
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
5. Insta nce s of the N ash Pr ediction Ga me
In th is sec tio n, w e pr esent tw o in sta nce s of the Nash prediction g ame and in v estig ate under whic h
conditions those g a me s possess unique N a sh equilibria
W e sta .rt by specifying both players’ loss
ℓ q 1( z, y) is the
function and re gula rizerAn
. ob vious choice for the loss function of the learner
zer o-one loss
defin e d by
_ 1 , if yz < 0
ℓ 0/ 1( z, y) :=
.
0 , if yz ≥ 0
( z,is
q 1) w hich pena liz e s positi v e decision v alA possib le choice for the da ta generato r’ s ℓlo
0/ 1ss
uesz, in dependently of th e cla ss laTbel.
he ra tio nale behind this choic e is th at the data generato r
e xper ie nces costs when the learner blocks a n e v ent, that is , assigns an instan c e to the positi v e cla ss
F or instan c e, a le g itimate ema il se nd e r e xperien c es costs when a le gitimate ema il is err oneously
block e d just lik e an ab usi v e se nd e r , als o a ma lg amate
to the
, e xperiences costs
data dg in
ener
ator
w he n spam me ssages are block e d . Ho we v er , the zero-one loss v io la tes Assumptio n 2 a s it is neith
con v e x nor twic e continuo usly dif f erentiable
In the f. o llo w ing sections, we th erefor e approxima te
the zer o-one lo ss by the
and a ne wly de ri vtred
, wsshich both satisfy
lo g istic lo ss
igonom e tr ic lo
A s su mp tio n 2.
ˆ + 1( D, D˙ ) is an estimate of the tr a n s forma tion costs that the data generato r inc urs
Recall th atΩ
w he n transfor min g th e dis trib ution that generates thexin
sta
tr nces
ain ing time into the dis trib ui at
x˙ i at a pp lication time. I n our analy sis , w e appr o ximate these c o sts
tion that gene rate s the instances
by the a v erage squar
ed tance be tw e
xi en
andx˙i in the featur e spac e induced b y maφ,pping
tha t
l 2-dis
is ,
n
ˆ + 1( D, D˙ ) := 1 ∑ 1 kφ( x˙ i ) q φ( xi ) k22 .
Ω
( 22)
n i= 1 2
ˆ q e1(rw) penalizes the comple xity of the p r edic ti v he( xmode
Ω
) = sign
The learner ’ s re gulariz
l fw ( x) .
W e consider T ikhono v re gula rization, whic h, f or linear de cis ion
fw , freduces
un c tio ns
to the squ a red
2-norm ofw,
l
ˆ q 1( w) := 1 kwk22.
Ω
( 23)
2
Before prese n ting Nash
the lo gis tic r e g r ession
(NL R) and thNash
e
support vec tor mac (NS
hine VM),
w e tur n to a discussion on the applicability of genera l k e rnel functions.
5.1 A pp ly ing K er nels
φ :pping
So f ar , w e assumed the kno wle dge of featur e ma
X → φ( X) such that w e ca n compute
( xi ) of the tr a ining instances
an e xplicit featur e repre sentaφtion
xi for all i = 1,..., n. Ho we v er ,
in some applic a tio ns, such a featur e ma pping is u nwie ld y or hard Inste
to ideandtify
, one
. is of te n
equipped with a k ernel function
k : X × X → R wh ich measures the simila r ity b e tw e en tw o in sta nce s.
G e nerally , k er nel function
k is assumed to be a positi v e-se mid efinite k erne l such tha t it can be state d
∃φ w e,
in terms of a scalar product in th e corr esponding repr odu c ing k er nel Hilbe r t spac
ithth at is,
k( x, x′ ) = φ( x) T φ( x′ ) .
T o apply the represe n ter theore m (see , e.g., Sch ¨olk opf et al., 2001) we assume tha t the tr a nsformed instances lie in th e span of the ma pped train ing instances, that is , w e restrict th e data gener ator ’ s a ctio n space such that the tr ansf o r me xdi insta
ar e mapp
nc ese˙ d into the same subspace of the
repr od uc ing k ernel H ilbert s p a ce as the unmodifie d trainin xgi .inBsta
y this
nces
assumptio n, the
2636
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
φ( x˙ i ) ∈ φ( X) for i = 1,..., n can be e xpre ssed
w e ight v ec w
tor∈ W and the transfor me d instances
∃α that
as lin ea r comb inations of the ma pped trainin g instances,
such that
i , Ξi j is,
n
w=
∑ αφi ( xi )
i= 1
n
and φ( x˙ j ) =
∑ Ξi φj ( xi )
∀ j = 1,..., n.
i= 1
Further , let us assume tha t th e action spa
and φ( X) n c an be ade q ua te ly tr anslate d into dual
W ces
action spaces
A ⊂ Rn and Z ⊂ Rn× n, whic h is po ss ible , for in sta nce,
W aif n dφ( X) n are c losed
2-balls. Then, a k ernelized v ariant of the N a sh pre dic tion g ame is obta ine d by in serting th e a bo v e
l
equations into the play e rs’ cost functions in (1) and ( 2) with re g ularizers in ( 22) and (23),
θˆ q 1( α , Ξ) =
θˆ + 1( α , Ξ) =
n
1
∑ cq 1,i ℓ q 1( αT K Ξei , yi ) + ρq 1 2 αT K α ,
i= 1
n
1
∑ c+ 1,i ℓ + 1( αT K Ξei , yi ) + ρ+ 1 2ntr _( Ξ q I n) T K ( Ξ q I n) _ ,
( 24)
( 25)
i= 1
Ξ ∈ Z, is the dual tr a nsei ∈ { 0, 1} n is the i -th unit v ec torα ,∈ A is th e d ua l w eight v ector
w he re
formed d a ta ma trix,
and K ∈ Rn× n is the k ernel matr ix withKi j := k( xi , x j ) . I n th e dual Nash
predictio ng amewith c o sf tu nc tio ns
( 24 )and (25) ,the le ar n echooses
r
the dual w eig hvt ecto r
α =[ α 1,..., α n]T a nd cla ssifi es a ne w insta
nce
x by
h( x) = sign fα ( x) w ith fα ( x) = ∑ni = 1 α i k( xi , x) .
Ξ, whic
In contrast, th e da ta generato r chooses the dual transformed data
ma trix
h imp licitly re fl ec ts
the change of the training dis trib ution.
Ξ from
Their transformatio n costs are in proportion to the de via tion
of the identity matr ixI n,
Ξ
I
w he re if equals n, the learner ’ s task reduces to sta ndar d k erne liz e d e mpir ical ris k minimization.
α ag n d
w bycin
x˙ i by Ξei f or
The proposed Algorithms 1 and 2 ca n be readily applied w hen r epla
all i = 1,..., n.
An alternati v e appr oach to a k erne liz a tio n of the Nash prediction g ame is to first constr uc t an
e xplicit featu re re p r esentatio n with r espect to the gi v en k erknand
e l function
the tr ain ing instances
and then to train the N ash model by a pp ly ing th is featu re ma pping. Here, w e ag ain assume th at the
( x˙ i ) as w ell as th e w eight v ector
transformed instan cφes
w lie in the span of th e e xplicitly ma pped
φ( x) . Let us consider the k er nel P CA ma p (see, e.g ., Sch ¨olk opf and S mola, 2002)
training instances
defined by
1+
φPC A: x 7→Λ 2 V T [k( x1, x) ,..., k( xn, x)] T ,
( 26)
w he re
V is the column matr ix of e igen v ector s of k ernelKma
, Λ is
trixthe dia gonal ma trix with the
1+
T
2
ΛV , andΛ de n otes the pse u do- in v erse of the squ a re
corr esponding eigen v alu es such
K =thVat
1
1
2
2
root of Λ with Λ = Λ Λ .
R e mark 9N otice that for an y positi v e -semidefi nite k er n ek l: functio
n R and fix ed tr ain X× X →
φPC
ing in sta nces
x1,..., xn ∈ X, th e P CA ma p is a uniquely defined re al function
w Aith
: X → Rn
T
such thatk( xi , x j ) = φPC A( xi ) φP CA( x j ) for an yi , j ∈ { 1,..., n} : W e first sho w thφat
P CA is a re al
T
n
ma ppin g from th e input space
R . A sx 7→[k( x1, x) ,..., k( xn, x)] is a re al
X to the E uclidean space
v ecto r - v alue d functionVand
is a re aln × n ma trix, it r ema ins to sho w that the pseudo-in v er se of
1
2
Λ is real as well. S in ce the k ernel function is positi v e-se mid efinite, all eigeλ inov faK lues
a re
√
1
2
Λ is a diago na l ma trix with rea l dia gonal entrie
λ i forsi = 1,..., n. The
non-ne g a ti v e, and hence,
2637
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
1+
Λ 2matr
pseudo-in v er se of this matr ix is th e uniquely defined dia gonal
w ith
ix rea l non-ne g a ti v e
1
√
λ
>
dia gonal entr iesλi if i 0 and zero oth erw ise.
This pro v es th e fi r st claim.
T he P CA map als o
T
satisfiesk( xi , x j ) = φP CA( xi ) φP CA( x j ) for an y pair of train ing in sta xnces
i andx j as
φP CA( xi ) =
1+
Λ 2 V T [k( x1, xi ) ,..., k( xn, xi ) ]T
1+
=
Λ 2 V T K ei
=
Λ 2 V T V ΛV T ei
=
Λ 2 ΛV T ei
1+
1+
for all i = 1,..., n and c o nse qu e ntly
φP CA( xi ) T φP CA( x j ) =
=
=
=
1+
1+
eTi V ΛΛ 2 Λ 2 ΛV T ej
+
eTi V ΛΛ Λ V T ej
eTi V ΛV T ej
eTi K ej = K i j = k( xi , x j )
w hich pro v es th e se cond claim.
3
An equilibrium strate g y pair
w∗ ∈ W and [φP CA( x˙ ∗1) T ,..., φP CA( x˙ n∗ ) T ]T ∈ φ( X) n can be id entifi e dby applying the P CAma ptogeth erw ith Algorithms 1 or 2. T oc la ssifya ne winstance
x ∈ X w e may fi rs t map
x into the PCA ma p-in duced featu re space and apply th e lin ea r cla ssifie r
∗
h( x) = sign fw ( x) w ith fw∗ ( x) = w∗ T φP CA( x) . Alternati v ely , w e can deri v e a dual re p r esentatio n
of w∗ such th atw∗ = ∑ni = 1 α ∗i φ P CA( xi ) , and consequentlyfw∗ ( x) = fα ∗ ( x) = ∑ni = 1 α ∗i k( xi , x) , w he re
α ∗ =[ α ∗1,..., α ∗n]T is a no t necessar ily uniquely defined dual weight v ecto
w∗ .r of
T he refore , we
∗
α o f the linear syste m
ha v e to identify a solution
1+
∗
∗
w = Λ 2 VT K α .
A direct calc ulatio n sho w s tha t
( 27)
1+
α ∗ := V Λ 2 w∗
( 28)
λ i of the
is a so lu tio n of ( 27) pro vide d th at either all eleme
ntsdia gonal matrΛixare positi v e or th at
λ i = 0 implies that the same component of the v ecto
w∗ is
r also equal to zer o (in whic h case the
solution is non-unique) . In f act, in serting (28) in (27) then g i v e s
1+
1+
1+
1+
1
1
2
Λ 2 V T K α ∗ = Λ 2 V T V ΛV T V Λ 2 w∗ = Λ 2 Λ 2Λ Λ
1+
2
w∗ = w∗
w he reas in the othe r cases the linear system (27) is o b viously inconsiste
The adv
nt.antage of the
∈ X r equir es the c omputa tion of th e scalar
latte r approach is that classif yin g a ne w in stax nces
product ∑ni = 1 α ∗i k( xi , x) r ather than a matr ix multiplicatio n when ma pping
x into the P CA mapinduced featur e space ( cf. Equ a tio n 2 6) .
When imp leme nting a k ernelized solution, the data gener ator has to genera te insta nc es in the
input space w ith dual r epresentatio
K Ξn∗ e1,..., K Ξ∗ en and φPC A( x˙ 1∗ ) ,..., φPC A( x˙ ∗n) , respec ti v ely .
T o this end, the data gener ator must solv e a pre- ima ge pr ob lem whic h typically ha s a non-unique
solution. H o w e v er , as e v ery s o lu tio n o f this proble m incur s the same costs to both pla yer s the da
genera tor is fre e to select an y of them.
T o find suc h a solution, the data generato r ma y so lv e a
non-con v e x optimizatio n pr o blem as pr o posed b y Mik a et a l. ( 199 9) , or may apply a non-iterati v e
me th od (Kw ok and Tsang, 2003 ) based on multidimensional scaling.
2638
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
5.2 Nash L ogistic R egr ession
In this section w e s tudy the particular instance of th e N ash pre d iction g ame w her e each players’ lo ss
1
function rests on the ne g ati v e lo g arithm of th e logis ticσf(un
a) :c= tio
1+ n
eq a , that is, thelo gis tic
loss
ℓ l ( z, y) := q log σ( yz) = log t 1 + eq yz_.
( 29)
W e c o nsid er the re gu larizers in ( 2 2) a n d (23) , r especti v ely , w hich gi v e rise to the follo wing de fi
of theN ash lo gis tic r e gr e(N
ssion
LR).
x :=[ sxT1 ,..., xTn ]T and x˙ :=[ x˙ T1 ,..., x˙ Tn ]T ag ain deI n the follo w in g definition, column v ector
note the concatenatio n of the or iginal and the tr ansf orme d tr ain ing instances, r especti v ely , whic h
are ma pped into the featur e spacxi e:=byφ( xi ) andx˙ i := φ( x˙ i ) .
Defi nition 10 T heN ash lo g istic re g r ession
( N LR ) is a n in stance of the N ash pr e d iction game w ith
⊂ Rm an dφ( X) n ⊂ Rm∙n and cost fu nctio ns
non-empty , comp ac t, and con v e x actioW
n spaces
θˆ lq 1( w, x˙ ) :=
θˆ l+ 1( w, x˙ ) :=
n
1
∑ cq 1,i ℓ l ( wT x˙ i , yi ) + ρq 1 2 kwk22
i= 1
n
∑ c+ 1,i ℓ l ( wT x˙ i , q 1) + ρ+ 1
i= 1
1 n 1
kx˙ i q xi k22
2
n i∑
=1
l
wh e rℓ e
is specified in( 29).
ℓ + 1function
( z, y) := ℓ l ( z, q 1) peAs in ou r intro ducto ry dis cussio n , the data g e nerator ’ s loss
nalizes positi v e dec is ion v alues independently of th e class
contrast, ins ta nces that pass
y. Inlabel
the classifier , that is , insta nc es w ith ne g ati v e dec is ion v a lues, incur little or almost no costs . By the
abo v e d e fi nition, the N as h lo gis tic re gression ob viously sa tis fies Assumption 2, and accor ding to
the follo wing cor olla r y , als o satisfi es Assumption 3 f o r suitable re gula rization para me te r s.
C or o llary 11Let the N ash lo gistic r e g r ession be specifi ed as in D e fin ition 10 w ith positive r e gular iz ation par ameteρrq 1s an dρ+ 1 wh ic h satisfy
ρq 1ρ+ 1 ≥ ncTq 1c+ 1,
( 30)
then Assump tion 2 and 3 hold, and consequently , the Nash lo gistic r e gr ession po s sess a unique N ash
equilibrium.
ℓ q 1(with
Pr o ofBy
. Defi nition 10, both pla yers emplo y th e logistic loss
z, y) := ℓ l ( z, y) andℓ + 1( z, y) :=
l
ℓ ( z, q 1) and th e re gula rizers in (22) and ( 2 3) , respecti v ely . Let
ℓ ′q 1( z, y) = q y 1+1eyz
ℓ ′′q 1( z, y) = 1+1ez 1+1eq z
ℓ ′+ 1( z, y) =
ℓ ′′+ 1( z, y) =
1
1+ eq z
1
1
1+ ez 1+ eq z
( 31)
∈ R. respe
denote the first and se cond deri v a ti v es o f the pla yer s’ loss f u nc tio nszwith
Further
ct ,to
let
ˆ + 1( x, x˙ ) = 1n ( x˙ q x)
ÑwΩˆ q 1( w) = w
Ñx˙ Ω
ˆ q 1( w) = I m
ˆ + 1( x, x˙ ) = 1n I m∙n
Ñ2w,w Ω
Ñ2x˙ ,x˙ Ω
denote the gradie nts and H essians of th e pla ye rs’ re gularizers. A s su mp tio n 2 holds as:
2639
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
, y) and ℓ + 1( z, y) ar e positi v e and contin uous for
1. T he the second deri v a ti vℓ qes
z ∈all
R
1( zof
ℓ v(, z, y) is con v e x and twice contin uously d if f erentiable w ith respe ct
andy ∈ Y . C onsequently
to z for v ∈ {q 1, + 1} and fi x eyd
.
2. T he Hessians of the p layers’ re g ularizers a re fix ed, positi v e definite matr ices a n d consequently
bo th re gulariz e rs ar e twic e continuously dif fer entiab le and unif or mly strongly
w ∈ con v e x in
n
n
x ∈ φ( X) ), respecti v ely .
W andx˙ ∈ φ( X) (f or an y fix ed
3. B y Definition 10, the pla yers’ a ctio n sets ar e non-e mpty , compact, and con v e x subse ts of
fi nite-dimensional Euclid e an space s.
∈ R andy ∈ Y :
Assumptio n 3 holds as f orzall
, y) andℓ + 1( z, y) in (31) are equal.
1. T he second der i v ati vℓ qes
of
1( z
2. T he sum of the first deri v a ti v es of th e lo ss f un c tio ns is bou nded,
ℓ ′q 1( z, y) +
ℓ ′+ 1( z, y)
(
1
1
= qy
+
=
1 + ey z 1 + eq z
1q eq z
1+ eq z
2
1+ eq z
, if y =+ 1
∈ ( q 1, 2) ,
, if y = q 1
w hich together with E quation 14 gi v es
τ=
1_′
ℓ q 1( fw ( x) , y) + ℓ ′+ 1( fw ( x) , y) _
_
_< 1.
( x,y) ∈ φ( X) × Y 2
sup
τ is strictly le ss th an 1 since
T he supr emum
fw ( x) is finite for compact actio n seW
ts and
1
φ( X) n. T he smallest eige n v a lues of the players’ re gulariz
λ q er
=
λ
=
and + 1 n , such
1 s1are
that inequalities
1
ρq 1ρ+ 1 ≥ ncTq 1c+ 1 > τ 2
cT c+ 1
λ q 1λ + 1 q 1
ho ld .
ˆ + 1( x, x˙ ) = 1n ( x˙ i q xi ) o f the data genera tor’ s r e gula r iz er is in depenÑx˙ i Ω
3. T he partial gra d ient
dent ofx˙ j for all j 6 i andi = 1,..., n.
=
As A ssumptions 2 and 3 are satis fied, the e x iste nce o f a unique Nash equilibrium follo w s immediately f rom T heore m 8 .
Recall, tha t the w eighting f actocv,rs
∑ni = 1 cv,i = 1 for both players
i are stric tly positi v e with
1
v ∈ {q 1, + 1} . In particula r , it theref ore follo w s th at in the unw e ighte d case
cv,iw=her
n feor
all i = 1,..., n a n dv ∈ {q 1, + 1} , a suf ficient condition to ensure the e xis te nc e of a uniq u e Nash
ρq 1 ≥ rρ1+to
equilibriu m is to set the le a rner’ s re gula rization paramete
.
1
5.3 Nash S uppo r t Vector Machine
The N ash logistic re gr ession tends to non-sparse solutio
T hisns.
become s pa rticula rly apparent if
∗ ∗
(
,
)
˙
the Nash equilibriumw x is an in te rio r point of th e joint actionW
se×t φ( X) n in w hich c ase
, x˙ ∗ )at
the (partial) gr adie nts in ( 9) a n d (10) are( w
z ∗ero
. F or re gularizer (23), th is imp liesw∗thisat
2640
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
x˙ i w here all weighting f acto rs ar e non-zero sin ce
a linear c o mb ination of the transformed instances
= 1,..., n.
the fi r st d e ri v ati v e of the lo gis tic loss as w ell as the
cqc1,ioastref actor
non-zsero f ori all
The support vecto r mac hine
(SV M), whic h emplo ys hing
the e loss
,
ℓ h( z, y) := max( 0, 1 q yz) =
_ 1 q yz , if yz < 1
,
0
, if yz ≥ 1
does no t suf fer f rom non-sparsity , ho w e v er , the hinge los s ob vio u s ly vio la tes Assumptio n 2 as it
is not twic e contin uously dif fer entiab le. Theref ore, we propose a twic e contin uously dif fere ntia ble
loss function tha t we c tralligonometr ic lo, ss
w hich satisfies Assumptions 2 and 3.
D e fi nition 12For any fixe d sm oothnes s fac> tor
0, the
s tr ig on ometr ic lo
is ss
defined by
t
ℓ ( z, y) :=
q yz
, if y z < q s
π
ost 2syz_ , if |yz| ≤ s .
0
, if y z > s
sq yz
s
2 q πc
( 32)
The tr ig ono metric loss is similar to the hinge loss in tha t it, e x c ept around the decis io n bound∈ Rion
ary , penalizes mis c la ssifica tio ns in pr opo r tio n to the d ezcis
and
v alu
attains
e zer o for cor rec tly classified instances.
Analogous to the once continuously dif fer entia ble H uber loss w he re a
polynomial is embedded into th e hinge loss, th e trigonome tric loss c ombines th e per ceptr on lo ss
ℓ p( z, y) := ma x( 0, q yz) with a trigonome tric function.This tr ig on ometr ical embeddin g yie lds a
con v e x, twic e continuo usly dif fe rentiable function .
t
( z, y) is c o n ve x and twic e c on tinuously dif fer entiable with r e Lemma 1 3The trigonom etric lo ℓss
spect to z∈ R for any fixed y∈ Y .
Pr o ofLet
.
qy
t′
ℓ ( z, y) =
q
1
1
t π _
2 y + 2 y sin 2s yz
0
t ′′
ℓ ( z, y) =
π
4s c
0
π
ost 2syz_
0
, if yz < q s
, if |yz| ≤ s
, if yz > s
, if yz < q s
, if |yz| ≤ s
, if yz > s
denote the fi r st and se cond deri v a ℓtit (vz,es
respecti v ely , with respecz t∈toR. T he tr ig onoy) ,of
′′
t
ℓ ( z, y) is con v e xzin
) is stric tly
∈ R (f o r an y fixyed
∈ Y ) as the sec on d deri vℓat (tiz,vye
me tric loss
positi v e |ifz| = |yz| < s and zer o other w is e . Mo r eo v er , sin ce th e se cond deri v a ti v e is c on tinuous,
′′
lim ℓ t ( z, y) =
|z|→sq
π
π
′′
c os_ ± _ = 0 = lim ℓ t ( z, y) ,
|
|→
+
z
s
4s
2
the trigonome tric loss is als o twice continuously dif f erentiable .
Because of the simila r ities of the loss f unctio ns, w e call the N ash predic tion g ame that is based
upon the trigonome tric loss
Nash support v ector mac hine
(N SVM) where we ag ain consider the
re gula rizers in (22) and (23) .
2641
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
Defi nition 14 T heNash support v ector machine
(NSV M ) is an insta nce of the Nash pr edic tion g ame
⊂ Rm and φ( X) n ⊂ Rm∙n and cost functions
with non-empty , compact, and con v e x actionW
spaces
n
θˆ tq 1( w, x˙ ) :=
1
cq 1,i ℓ t ( wT x˙ i , yi ) + ρq 1 kwk22
∑
2
=
i 1
n
θˆ t+ 1( w, x˙ ) :=
∑ c+ 1,i ℓ t ( wT x˙ i , q 1) + ρ+ 1
i= 1
( 33)
1 n 1
kx˙ i q xi k22
n i∑
2
=1
t
wh e rℓ e
is specified in( 32).
The follo wing c o r olla r y sta tes suf ficie nt conditions under whic h the N ash support v ec tor machine satisfies Assumptions 2 and 3, and consequently ha s a un iq ue Nash equilibr ium.
C or o llary 15Let the Nash support v ector m ac hin e be specifi ed as in D efinition 14 with positive
r e gulariz ation p ar am ρeter
s ρ+ 1 whic h satis fy
q 1 and
ρq 1ρ+ 1 > ncTq 1c+ 1,
( 34)
then Assum ptions 2 and 3 hold, an d consequently , the Nash suppo r t vecto r m ac hine has a unique
Na s h equilibrium.
, y)ith:= ℓ t ( z, y) and
Pr o ofB. y D e fi nition 14, both pla ye rs emplo y the trigonome tric loℓ qss
1( zw
ℓ + 1( z, y) := ℓ t ( z, q 1) and the re gula rizers in (22) and (23) , respecti v ely . A ss u mp tio n 2 holds:
ℓ t ( z, y) , a n d consequently
ℓ q 1( z, y) and ℓ + 1( z, y) , are con v e x and
1. A ccor ding to L emma 13,
tw ice c o ntinuously d if f erentiable with respect
z ∈ R to
(for an y fi x ed
y ∈ {q 1, + 1} ).
2. T he re gulariz e rs of the Nash support v ecto r machine a re equal to th at of the N ash logis tic
re gre ssion and possess the same properties as in Theorem 11.
3. B y Definition 14, the pla yers’ a ctio n sets ar e non-e mpty , compact, and con v e x subsets of
fi nite-dimensional Euclid e an space s.
A s su mp tio n 3 holds:
, y) andℓ + 1( z, y) are equal for allz ∈ R since
1. T he second der i v ati vℓ qes
of
1( z
′′
ℓ t ( z, y) =
_
π
4s c
π
o ts2sz_ , if |z| ≤ s
0
, if |z| > s
∈ Y.
do e s not dependenty on
q 1: as f o r
2. T he sum of the first deri v a ti v es of th e lo ss f un c tio ns is bou
y = nded
ℓ ′q 1( z, q 1) +
ℓ ′+ 1( z, q 1)
t′
= 2ℓ ( z, q 1) =
2642
0
1 q sint q
2
π _
2sz
, if z < q s
, if |z| ≤ s ∈ [0, 2],
, if z > s
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
and fory =+ 1:
ℓ ′q 1( z, + 1) + ℓ ′+ 1( z, + 1) =
q1
, if z < q s
π
sint 2sz_ , if |z| ≤ s ∈ [q 1, 1].
1
, if z > s
T ogether w ith Equation 14, it follo ws th at
τ=
1_′
′
_
_ℓ q 1( fw ( x) , y) + ℓ + 1( fw ( x) , y) _≤ 1.
( x,y) ∈ φ( X) × Y 2
su p
λ q 1 =are
T he sma llest eigen v alu es of the players’ re gula rizers
1 and λ + 1 =
inequalities
1
ρq 1ρ+ 1 > ncTq 1c+ 1 ≥ τ 2
cT c+ 1
λ q 1λ + 1 q 1
1
n,
such tha t
ho ld .
ˆd+ ient
˙ ) = 1n ( x˙ i q xi ) of the da ta
3. A s f or N ash logis tic r e gres sion, th e par tia l Ñ
gra
x˙ i Ω
1( x, x
generator ’ s re gulariz e r is in dependent
x˙ j for of
all j 6 i andi = 1,..., n.
=
Because Assumptio ns 2 and 3 are satisfi e d, the e xis tence of a unique Nash equilibriu m follo ws imme diate ly from Theorem 8.
6. Experi menta l Ev aluat ion
The goal of this section is to e xplo re the relati v e strengths and w e aknesses of the dis cussed in stan c es of the Nash pr edic tion g ame and e xistin g r efer ence methods in the c o nte xt o f email spam
filtering. W e compar e a re gular
(SVM ) lo
, gis tic r e gr ession
( L R), the
support vec tor m ac hine
SVM
(SVMT , a v a ria nt of the SV M w hich min imiz es ( 33 ) f o r th e gi v en tr a ining
with tr igonometr ic loss
data), th e w orst-case solution
v ar -SVM , G lo berson
SVM for in variances with fe atur e r e m(In
o val
and R o w eis, 2006; T eo et al., 2007), and th e N ash equilibrium str
N ash
a te lo
gies
gistic r e gr essio n
(NL R) andNash support vecto r ma c (hine
N SV M) .
data se t
ESP
Mai li n glis t
P r i v at e
TRE C 20 07
ins ta nce sf ea t u r es
169, 61 2
128, 11 7
108, 17 8
75,49 6
541,71 3
266,37 8
582,10 0
214,83 9
del iv er y peri o d
01/ 0 6/20 07 - 27 /04/20 10
01/ 0 4/19 99 - 31 /05/20 06
01/ 0 8/20 05 - 31 /03/20 10
04/ 0 8/20 07 - 0 7/06/20 07
T able 1: Data sets used in th e e xp e riments .
W e use four corpor a of chronologically sorted emails detaile d in T able
The1first
: data s et
contains ema ils of an email se rvic e pro vider (ESP ) collected betwee n 20 07 a n d 2010. Th e second
(Mailinglist) is a collection of emails fr o m publicly a v a ila ble ma iling lists augmen ted by spam
emails f rom Bruce G uenter’ s spa m tr a p of the same time per iod. The third corpus (Pri v ate) contains
2643
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
ne wslette r s and spam and non-spam ema ils of the authors. The la st corpus is the N IST TRE C 2007
spam corpus.
φ( xg) is defined such th at email
Featu re mapp in
x ∈ X is to k eniz e d w ith the X-tok eniz er (Sie f k es
et al., 2004) and con v erted in m
to-dimensional
th e
bina ry bag- of-w ord v xector
:=[ 0, 1]m. T he v a lue
of m is determined by the number of distin ct terms in th e data set w her e we ha v e re mo v ed all te r ms
w hich oc cur only once. F or eac h e xper imen t a n d each repetition, we then construct the PC A map, x′ ) := xT x′
ping (26) with re sp e ct to the c orresponding
n tr a ining ema ils using the linear k ke( xrnel
n
resulting inn-dimensional tr ain ing instanφcP es
CA( xi ) ∈ R for i = 1,..., n. T o ensure the con v e xity
as well as th e compactne ss r equir e me nt in Assumption 2, we notio nally restrict the pla ye rs’ action
φ( X) := { φP CA( x) ∈ Rn | k φP CA( x) k22≤ κ } and W := { w ∈ Rn | k wk22≤ κ } f or
sets b y defining
κ . N ote that by choosin g a n arbitrarily lar
κ , th
some fix ed constant
gee pla yer s’ action sets become
ef fecti v ely unb ounded.
F or both algorithms , ILS a nd ED S, weσse
:= t 0.001, β := 0.2, and ε := 10q 14. Th e alg orithms ar e stoppedl ife xceeds 30 in line 6 of IL S and line 5 in E DS , respe cti v ely; in this case, no
con v er gence is achie v ed. In all e xperiments , we use th e F-measure—that is , the harmonic me a n of
precision and rec all— as e v alua tio n measure and tu ne all par ame ters w ith r espect to lik elihood. The
particular pr o to col and re su lts of e ach e xperiment are d e ta iled in the follo w in g sections.
6.1 Con v er ge n ce
Co r ollarie s 11 (f or Nash logis tic re gression) and 15 (for the N a sh su pport v ecto r ma c h in e) specify
conditions on th e re gulariz a tio n para ρ
me
a ndρ+ 1 under wh ich a un iq ue N ash equilibriu m
q 1 ters
necessa rily e xis When
ts.
this is the ca se, both the ILS and EDS a lgorithms w ill c o n v er g e on tha t
N a sh equilibrium. In the fi r st se t of e xper ime nts, we study w hethe r re p e ate d r esta r ts of the algorit
con v er ge on th e sa me equilibrium when th e bounds in E quations 30 and 34 a re satisfi ed, and when
the y are violate d to inc reasin g ly lar ge de grees.
ρq 1 > ρ1+ 1 both bou nds (Eq ua W e setcv,i := 1n for v ∈ {q 1, + 1} and i = 1,..., n, s u c h that for
tions 30 and 34 ) are satisfied. F or e ach vρqalue
ofρ+ 1 and e ach of 10 re p e titio ns, we r ando mly
1 and
0) ( 0)
( w( solutions
, x˙ )
dra w 400 emails fr o m the data set and r un ED S w ith r ando mly chosen in itial
until con v er gence. W e run ILS on the same tr ain ing set; in eac h r epetitio n, w e ra nd omly choose a
dis tinct initial solution, and af te r each iteration
k w e compute the E uclidean dis tan c e between the
( k)
w te
ED S solution and the c u r rent IL S itera
. Figure 1 r eports on th ese a v erage E uclidean dista nc es b e tw e en distin c tly initialized
The
runs.
blue curv esρq( 1 = 2 ρ1+ 1 ) satisfy Equation s 30 and
34, the yello w curv eρsq 1( = ρ1+ 1 ) lie e xactly o n th e bou ndar y; all other curv es vio la te th e bounds.
D otted lines sho w the Euclidean dis tance between the N ash equilibriu m and the solu tio n of logis tic
re gression.
Our findings are a s follo ws. L ogistic re g r ession and re gula r S VM ne v er coincid e with the Nash
q2
equilibriu m—the Euclidean dista nces lie in the range between
and
1 0 2. I L S and ED S al w ays
con v er ge to ide ntic a l equilibria when (30) and (34) ar e satisfi ed ( blue and yello wThe
cur v es).
Euclid ea n dis tan c es lie at the th reshold of nume rical c omputin g ac curac y . Wh e n E quations 30 and
34 are v io la ted by a f acto r up to 4 (tu rquois e and r ed curv es ), all repetitions still c o n v er g e on the
same equilibrium, indicating th at the equilibriu m is either s till unique or a secondary equilibriu m
is unlik ely to be f oun d.
When th e bounds ar e violate d by a f actor of 8 or 16 (gre en and pur p le
curv es), th en some repetitions of the learning a lgorithms do no t con v er ge or start to con v e r ge to
2644
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
ρ+ 1 =2
6
ρ+ 1 =2
0
4
0
10
10
-2
-2
10
10
-4
-4
dis
ta
nc
e
to
N
E
10
dis
ta
nc
e
to
N
E
10
-6
-6
10
10
-8
-8
10
10
0
10
20
iterations
ρ+ 1 =2
30
40
0
0
10
-2
10
-4
10
10
dis
ta
nc
e
to
N
E
-6
10
10
-8
10
10
ρ− 1 =2
1
ρ+ 1
30
40
30
40
0
-2
-4
dis
ta
nc
e
to
N
E
10
10
20
iterations
ρ+ 1 =1
2
10
0
10
20
iterations
ρ− 1 =
1
ρ+ 1
30
ρ− 1 =2
-8
0
40
−1 1
ρ+ 1
-6
ρ− 1 =2
−2 1
ρ+ 1
10
ρ− 1 =2
20
iterations
−4 1
ρ+ 1
ρ− 1 =2
−6 1
ρ+ 1
Figure 1: A v erage E uclidean d ista nce between the E DS solution and the ILS solu tio n a t iteration
The dotted lin es sho w the
k = 0,..., 40 for Nash lo gis tic r e gre ssion on the ES P corpus.
dis tance betwee n the E DS solution and the solu tio n of lo g istic r e gressio
E rr or bar
n. s
indicate sta ndar d de viatio n.
dis tinct equilibria.In the latter case, learner and data genera tor ma y attain dis tinct equilibria and
ma y e xperience a n arbitrar ily poor o utcome when playing a N a sh equilibrium.
6.2 Regu la riz at ion P arameter s
ρv of
The re gulariz a tio n paramete
r sthe pla ye rs
v ∈ {q 1, + 1} pla y a major role in the predic tion
g ame.The learner ’ s re gularizer d e te rmines the ge n e ralization ability of th e pr edic ti v e model and
the data generato r’ s re gula rizer contro ls the a mount of change in th e d a ta g e neration
In
proce ss.
order to tu ne these paramete r , one w o uld need to ha v e access to labele d d a ta tha t ar e go v e rned b
the tr a n sf orme d input dis trib ution. In our sec o nd e xperiment, we w ill e xplo re to w hich e x tent those
paramete r s can be estimated using a por tio n of the ne west training data. Intuiti v ely , the most rec ent
training data ma y be more simila r to the te st data th an older trainin g data.
2645
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
Performance
Fixed ρ+ 1
1.0
0.9
0.95
ho ( ρ+ 1 =1024)
ho ( ρ+ 1 =8)
ho ( ρ+ 1 =4)
te ( ρ+ 1 =1024)
te ( ρ+ 1 =8)
te ( ρ+ 1 =4)
Fm
ea
su
re
1.00
0.8
hold-out datahold-out data
0.7
0.90
-3
-2
10
10
ρ− 1
Fixed ρ− 1
0.85
1.0
Fm
ea
su
re
10
-1
0.80
0.9
test data
test data
0.70
0
10
ho ( ρ− 1 =0 .125)
ho ( ρ− 1 =0 .002)
ho ( ρ− 1 =0 .0001)
te ( ρ− 1 =0 .125)
te ( ρ− 1 =0 .002)
te ( ρ− 1 =0 .0001)
0
10
Fm
ea
su
re
0.75
0.8
2
10
-2
10
-4
10
4
10
ρ+ 1
ρ− 1
0.7 0
10
1
10
2
ρ+ 1
10
10
3
Figure 2: Left: Perf o r ma nce of NLR o n the hold -out a nd the te st data w ith r espect to re gulariz ation
paramete r Right:
s.
Performance of NL R on th e hold-out data (ho) and the te st data (te)
ρ
for fix ed v alues ofv.
W e split the data set into three par ts : Th e 2,0 0 0 oldest emails constitute th e trainin g portion, we
use the ne xt 2,000 ema ils as hold-out portion on w hic h the par ame ters ar e tu ned, and the remaining
emails are used as test set. W e r andomly dra w 20 0 spam and 200 non-spam messages fr om th e tr ain ing portion and dra w another subset of 400 ema ils fr o m th e h old -out portio n. Both N PG instances
are traine d on the 400 trainin g emails and e v aluated ag ainst a ll ema ils of the te
To
st tune
po r tio n.
the par ame ters, we conduct a grid sea rch ma ximizin g the lik elihood o n the 400 hold-out emails.
W e repeat this e xperiment 10 times f or all four data sets and repor t on the r esulting paramete rs as
w e ll as the “optimal” r efer ence paramete r s ac cording to the ma ximal v alu e of F - me asure on the test
set. T hose optimal re gula r iz ation para me te r s ar e not used in late r e xperiments . T he intuition of the
e xper ime nt is that the data gener atio n process has already b e en cha n ge d between the oldest and the
late st emails . This change ma y cause a dis trib ution shif t whic h is re fl ec te d in the h old -out portion.
W e e xpect that on e can tune e ach pla yer s’ r e gula r iz ation para me te r by tuning with respect to this
hold-out set.
I n Figur e 2 (left) w e plot the perfor ma nce of the Nash logis tic re gressio n ( N LR ) on th e hold -out
ρq 1 and ρr+ s1. T he dashed line vis ualizes the
and the test data ag ain st the r e gula r iz ation paramete
bound in ( 30 ) on the re gula rization par ame ters for whic h N LR is guar ante ed to possess a unique
ρqng
Nash equilibrium. Figure 2 ( right) sho ws sectional vie ws of the le ft plot alo
th e(upper
1-axis
ρ+ 1-axis (lo w e r diagram) for se v era l v alues
ρ+ 1 of
dia gr am) and the
and ρq 1, res p e cti v ely
As .
e xpec te d, the ef fe ct o f th e re gula rization para me te r s on the test data is much stronger than on the
hold-out data.
2646
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
ρ+ 1 ’ has
I t tu rns out that the data g e nerator
s almost n o imp act on the v alu e of F - me asur e on
the h old -out data set (see lo w er right dia gr am of F igHence,
u r e 2).w e conclude tha t estimating
ρ+ 1 withou t a ccess to la beled d a ta from the test dis trib u tion or additional kno wle dge about the
data gener ator is dif fi c u lt f o r this application; the most recent trainin g da ta are still too dif fere nt
fr o m the te st data
I n .all remaining e x pe riments and f or all data setsρw
8 f or NL R and
+ 1e=set
ρ+ 1 = 2 f or NS VM. F or those choices the Nash models perf o r me d gener ally b e st on the hold -out
set for a la r g e v ariety of v alu ρes
. F o r In v ar - S VM the r e gula r iz ation of the data gener ator’ s
q 1of
transformatio n is c o ntrolled e x plicitly by th e Knumber
of modifiable attrib ute s per positi v e insta nce .
W e conducte d th e same e xper ime nt for In v ar -SV M resulting in an optimal
v alu
that
e is
of ,
K = 25;
the data generato r is a llo w ed to r emo v e up to 25 tok ens of each s p a m ema il of the tr ain ing da ta se
ρq 1 for gan y fix ed
ρ+ 1 see ms
From the upper r ight dia gr am of Figure 2 w e see tha t estimatin
possible. E v en if w e slig htly o v erestimate the le ar n e r’ s optimal re gulariz a tio n paramete r —to c o m
pensate for the dis trib utional dif f erence between th e transformed training sample and the ma r gina l
shifte d ho ld -out set—the determin ed v ρ
alue
q 1 isofclose to the optimum for all f ou r data sets.
6.3 Ev aluation f or A dv ersary F ollo wing an E qu ilibrium Strategy
W e e v alua te both a re gular c la ssifier traine di.i.d.
under
assumption
the
and a mode l th at f ollo w s a
N a sh equilibr ium str a te gy ag ain st bo th an adv er sary w ho does n ot transfor m the input dis trib ution
and a n adv er sary w ho e x ecute s the N as h - equilibria l transformatio n on th e input dis trib u tion. S in
w e cannot be certain that actual spam s enders play a N ash equilib r ium, we use the follo w in g semiartificia l setting.
The learner o bse rv es a sample of 200 spam and 200 non-spam e ma ils dra w n f rom th e tr a ining
˙ ; the tr i via
portion of th e data a nd e stimate s the Nash-optimal pr edic tion model w ith w
paramete
r sl
w
baseline solu tio n of re gula rized emp ir ical ris k minimization (ER M) is denote
. The
d b ydata
genera tor observ edis
s atinctsamp leD of 200 spam and 200 non-spam messages, als o dra w n fr om
˙.
the tr a ining p or tio n, and compute s their Nash-optimal D
response
1
W e a g ain cset
v,i := n for v ∈ {q 1, + 1} andi = 1,..., n and study th e f o llo wing four scenario s:
• ( w, D) : B oth pla yer s ignor e the presence of an oppon e nt; tha t is, the learner emplo ys a re gular
cla ssifie r a nd the sender d oe s not change th e data gener atio n process.
• ( w, D˙ ) : The le a rner ignor es the pre sence of an a cti v e data ge n e rator w ho changes the data
g e neration process suchDthat
e v olv esD˙tob y pla ying a Nash strate gy .
• (w
˙ , D) : The learne r e x pe cts a rational data generato r and c ho ose s a Nash-equilibr ia l predictio n model. H o w e v er , the data genera tor d oe s not change the input dis trib ution.
• (w
˙ , D˙ ) : Both pla yer s ar e a w ar e of th e opponent and p lay a Nash-equilibrial action to sec ure
lo west costs .
W e re p e at th is e xperiment 100 times for all four data
T able
sets 2
. reports on th e a v era g e v a lues
of F-me a su r e o v er all repetition s and both N PG instances and corres p onding baselines; numbe rs in
α = 0(.05) between the F-measures fof
boldf ace indicate significa nt dif fer ences
˙ for fix ed
w and f w
sampleD andD˙ , respe cti v ely .
As e xpec te d, when the data gener ator d oe s not alter the input dis trib ution, the re gula r iz ed empiric al r is k min imiz atio n baselines, lo gis tic re gr ession and the SV M, a re generally
Ho we v best.
er ,
2647
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
Ma iling lis t
ESP
NLR
v s.
LR
D
D˙
w
0.957
0.912
w
˙
0.924
0.925
D
D˙
ESP
NSVM
v s.
SVM
D
D˙
w
0.955
0.928
w
0.987
0.958
˙
w
0.984
0.976
P ri v ate
D
D˙
Ma iling lis t
˙
w
0.939
0.939
D
D˙
w
0.987
0.961
w
˙
0.985
0.976
w
0.961
0.903
w
˙
0.944
0.912
T R EC 2 007
D
D˙
P ri v ate
D
D˙
w
0.961
0.932
˙
w
0.957
0.936
w
0.980
0.955
w
˙
0.979
0.961
T R EC 2 007
D
D˙
w
0.979
0.960
˙
w
0.981
0.968
T able 2: Nash p r edic to r and re gular cla ssifi er a g ainst passi v e and Nash-equilibrial data generator .
the per formance of those b a selin e s drops substa ntially w hen the data genera tor plays the Nashequilibrial actionD˙ . T he N a sh-optimal predictio n models are mor e r ob ust ag ains t this tr a n sf orma tion of the input distr ib utio n a n d sig n ific antly outpe rform the ref erence methods for a ll four data
sets.
6.4 Case Study on Email Sp am Filtering
T o stu dy the perfor ma nce o f the N a sh predictio n models and the baselines for email spam fi lterin g,
w e e v alu ate all me thods
into the fu tur by
e pr ocessing the test set in c h r ono lo gic alT or
hedtest
er.
portion of each data set is split into 20 chr o nolo g ically sorted disjoint subsets . W e a v er age th e v a lue
of F - me asur e on each of th ose s u bse ts o v er the 20 mo dels (trained on dif fe rent samples dra wn fr o
the tr ain ing por tio n) for each me thod a nd perfor m at -test.
pairedI n the absence of in forma tion
1
=
about pla yer a nd instance-spe cific costs , w e ag
cv,i ain
: nset
for v ∈ {q 1, + 1} , i = 1,..., n. N ote,
that the chosen loss f u nc tio ns and re gularizers w ould allo w us to se le ct a n y p ositi v e cost f acto rs
with out viola ting Assumption 1.
Figure 3 sho ws th at, for all data sets , th e NP G instances outp erfor m logistic re g r ession (LR ),
SV M, and S VMT th at do not e xplicitly f a ctor th e adv ersa ry into th e optimiz ation
E spe
criterion.
cially f or the ESP corpus, th e N a sh logis tic r e gressio n (NL R) and th e N a sh su ppor t v ec tor machine
(NS VM) are superio On
r . the TR EC 2007 data set, the me thods beha v e comp arably w ith a slight
adv a n tage for the Nash supp or t v ecto r ma
Thechin
perio
e. d o v e r w hich th e T RE C 20 07 data ha v e
been collected is v er y sho r t; we belie v e that th e tr a ining and test instances ar e go v erned by near ly
identical dis trib utions.
Consequently , for this data set, the g a me the oretic models do not g ain a
significa n t adv a nta ge o v er logis tic r e gre ssion and th e SVi.i.d
M that
.samples.
assumeW ith respe ct
to the non-g ame th eoretic baselines, th e re gular S VM o utp erfor ms L R and SVMT for mo st of the
data sets.
T able 3 sho ws aggre g a te d res u lts o v er all four data sets. F or each point in each of the dia grams
Figure 3, w e c o nduct a pairw ise comp aris on o f all me th ods based
on
staatpair
a confidence
ed
t -te
α
=
.
le v el of
0 05. W h e n a dif fere nce is sig nificant, we c ou nt th is a s a win f o r th e method th at
achie v es a hig h e r v alue of F -measure.
E ach line of T able 3 details th e wins and, set in italics, the
lossesof one method ag a inst all other me th ods.
The N ash logistic re gressio n and the N a sh support v e ctor ma c h in e ha v e more wins than th e y
ha v e lo sses ag ainst e ach of the othe r methods. The ranking contin ues with In v ar -SV M, the re gular
SV M, logis tic re gression a nd th e tr ig on ometr ic loss SV M whic h lose s mor e fr equently th an it wins
ag ain st a ll other me thods.
2648
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
Performance on ESP corpus
Performance on Mailinglist corpus
0.88
0.95
Fm
ea
su
re
0.99
Fm
ea
su
re
0.94
0.91
0.82
0.76
Dec07
Jul08
Jan09
0.87
Aug09
Performance on Private corpus
Oct02
Feb04
Jul05
Performance on TREC 2007 corpus
0.97
0.99
0.86
Fm
ea
su
re
Fm
ea
su
re
0.97
0.75
0.64
0.95
Nov06
SVM
Apr08
LR
TSVM
0.93
Aug09
Invar-SVM
Apr07
NLR (ILS)
May07
Jun07
Jun07
NSVM (ILS)
F ig u r e 3: Value of F-measure of pr edic ti v e mode ls . E r ror bars indic a te sta ndard err o r s.
me tho d vme
s . tho d S VM LR
S VM
0:0
40:2
LR
2:40
0:0
S VMT
0:53
5:49
I n v ar -SVM 20:30 29:19
NLR
57:8
59:5
N SVM
65:2
71:2
S VMT I n v a r -SV M
NL R N SVM
53:0
30:20
8:57
2:65
49:5
19:29
5:59
2:71
0:0
9:47
2:70
2:74
47:9
0:0
5:57
3:57
70:2
57:5
0:0
22:30
74:2
57:3
30:22
0:0
T able 3:Results of pair etd-test o v er all corpora:
N umbe r o f tr ials in whic h each me th o d (r o w)
has sig nificantly outpe rformed each other
method (column)vs. numbe r of times itwas
. d
outpe rform e
6.5 Efficie ncy v ersus Effe cti v en e ss
T o a ssess th e predicti v e performance as w e ll a s the e x ecution time a s a f un c tio n of the sample siz
w e train th e baselines and the tw o N PG instances f o r a v ar y in g numbe r of training
W ee xamp le s.
2649
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
Performance on ESP corpus
Execution time on ESP corpus
0.95
3
10
0.9
1
10
ti
m
e
in
se
c
Fm
ea
su
re
0.85
0.8
-1
10
0.75
50
100 200 400 800 1600 3200
number of training emails
50
100 200 400 800 1600 3200
number of training emails
SVM
LR
TSVM
Invar-SVM
NLR (ILS)
NSVM (ILS)
NLR (EDS)
NSVM (EDS)
Figure 4: Predic ti v e per forman c e (left) and e x ecution time (r ight) for v arying siz es of the tr a ining
data set.
repor t on th e results for the ES P da ta set in F igurThe
e 4.g ame the o r etic models significantly
outper form the tri via l baseline me th o ds logis tic r e gre ssion, the S VM and th e SV MT , especially f or
sma ll d a ta sets
H o. w e v er , this comes at the pric e of considera bly higher computatio
The
nal cost.
ILS algorithm re q uires in ge n e ral only a c o uple of iterations to con v er ge; ho w e v er in each iteration
se v eral optimiz ation pr o blems ha v e to be solv ed so that the total e x ecution time is up to a f acto r
150 lar ger th an tha t of the corre spon din g ERM base lin e. In contrast to the ILS a lgorithm, a sin g le
iteration of the ED S a lgorithm does n ot require solvin g nested optimiz ation problems . H o w e v er , the
e x ecutio n time of the ED S algor ith m is still higher a s it often r equir es s e v er al thousand iterations
to fully con v e r ge.
F or la r ger data sets , th e dis cr epanc y in predicti v e per forman c e between g a me
theor etic models and
O ur r esults do n ot pro vide c on c lusi v e e vidence
i.i.d. baseline decr eases.
w he ther I L S or E DS is f aste r at solving th e optimiz ation problems . W e conclude that the bene fi t of
the NPG pr edic tion models o v er the classification baseline is greatest for small to med iu m sample
sizes.
6.6 Nash - E quilib ria l T ransf ormation
In contrast to In v ar -SV M, th e N ash models a llo w the data gener ator to modify n on- spam emails.
H o w e v er in practice most senders of le gitimate messages do not deliberately change their writing
beha vio r in order to bypass spam filters, pe rhaps with th e e x c eptio n of senders of ne w s le tters w ho
must be caref u l not to tr ig g e r fi lterin g me c hanis ms .
In a fi nal e xperiment, we w ant to study w heth er the N ash model reflects th is aspect of reality ,
and ho w the da ta ge n e rator ’ s re gulariz er ef fe cts this transformatio
The trainin n.
g portion contains
, 000 4
ag ainn+ 1 = 200 spam andnq 1 = 200 non-spam instances randomly chosen f rom the oldest
2650
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
emails . W e dete r min e the N ash e q uilibr ium and measure th e nu mb er of a dd itions a nd d e le tions t
spam and non-spam emailsD˙in
:
d
Δad
q 1 :=
1
nq 1
Δdel
q1
1
nq 1
:=
∑
m
∑ ma x( 0, x˙ i , j q xi , j )
Δadd
+ 1 :=
1
n+ 1
∑ ma x( 0, xi , j q x˙ i , j )
l
Δde
+1
1
n+ 1
i :yi = q 1 j = 1
m
∑
i :yi = q 1 j = 1
:=
m
∑
∑ max( 0, x˙ i , j q xi , j )
∑
∑ max( 0, xi , j q x˙ i , j )
i :yi =+ 1 j = 1
m
i :yi =+ 1 j = 1
Δav ddand Δdel
xi , j in dic ates th e pre sence o f to
wh e re
j ink the
en i -th trainin g email, that is,
v d e no te
the a v era ge number of w or d addition s and dele tions per s p a m and non-spam ema il perfor me d by
the sender .
Figure 5 sho w s the number of additio ns and de le tions of the N a sh transformatio n as a function
of the adv ersar y’ s r e gula r iz ation paramete r f or the ESPTdata
able set.
4 r eports on th e a v er age
number of w or d additio ns and d e le tions for a ll data
F sets
or I n. v ar - S VM, w e set the number of
possible deletion s K
to= 25.
ES P
Ma ili n gli st
g ame
n on - spa m
s pa m
mo del
ad d del
a dd del
In v a r - S VM0.0
0.0
0.0 24.8
NLR
0.7
1.0 22.5 31.2
NS VM
0.4
0.5 17.9 23.8
g am e
no n-s pa m
m odel
ad d d e l
I n v ar -SVM 0.0
0.0
NLR
0.3
0.4
N SVM
0.3
0.3
Pri v a te
s p am
add
del
0.0
23.9
8.6
10.9
6.9
8.4
T REC 2 00 7
g ame
n on - spa m
s pa m
mo del
ad d del
a dd del
In v a r - S VM0.0
0.0
0.0 24.2
NLR
0.4
0.2 24.3 11.2
NS VM
0.1
0.1 15.6
7.3
g am e
no n-s pa m
m odel
ad d d e l
I n v ar -SVM 0.0
0.0
NLR
0.2
0.2
N SVM
0.2
0.1
s p am
add
del
0.0 24.7
15.0 11.4
11.1
8.4
T able 4: A v e rage nu mb er o f w ord a dd itions and deletio ns per tr ain ing e ma il.
The N ash- equilibria l transfor ma tion imp oses almost no changes on an y non - spam ema il; the
number of modifications declines as the re gula rization paramete r gro ws (see Figure 5) . W e observ e
for all data sets that e v en if the total amount of transformatio n dif fer s for NL R a n d NS VM, both
instances beha v e simila r ly insof ar as the number of w ord additions and dele tions contin ues to gro w
w he n the adv ersa ry’ s re gu larizer decreas es.
7. Co ncl usion
W e studie d pr edic tion g ames in whic h the le ar n e r and the data gene rator ha v e c on flictin g b ut not
necessa rily d irectly antagon istic cost functions.
W e f ocused on sta tic g ames in w hich le ar ner and
data gener ator ha v e to commit simu ltaneously to a p r edic ti v e mo del and a transforma tion of the
input dis trib ution, respe cti v ely .
The c o s t- minimizing ac tio n of each pla yer depends on the opp onent’
in sthe
moavbsence
e;
of infor ma tion about th e opponent’ s mo v e, pla yers ma y choose to pla y a Nash equilibr ium strate gy
2651
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
Amount of transformation for NLR
70
Amount of transformation for NSVM
35
add
non-spam additions 
-1
60
-1
+1
-1
add
spam additions 
+1
20
del
spam deletions 
+1
30
del
non-spam deletions 
25
add
spam additions 
40
-1
30
del
non-spam deletions 
50
add
non-spam additions 
del
spam deletions 
+1
15
nu
m
be
r
of
m
od
ifi
cat
io
ns
10
nu
m
be
r
of
m
od
ifi
cat
io
ns
20
10
0
10
0
1
10
ρ+ 1
5
10
0
2
10
0
1
10
ρ+ 1
10
2
Figure 5: A v erage
number of additions and deletio nsper spam/no n- spam
ema ilfor N LR (left)
and N SV M (rig ht) with respect to the adv ersar y’ s re g ularizatio nρ+para
fix ed
te r
1 forme
ρq 1 = nq 1.
wh ich constitutes a cost- minimizing mo v e for eachif player
th e othe r player follo ws th e equilibr ium
as w ell. Because a combin atio n of actions f rom distin ct equilibr ia may le a d to ar bitrarily high c o sts
for either player , w e ha v e studied conditions under whic h a predictio n g ame c an be guar ante ed
to possess a unique N as h equilib r ium.
L emma 1 identifies conditions un de r w hich at least one
equilibriu m e xistsa,n d T heore m 8 e la borates on when this equilibrium is u niq
W eue.
propose
an ine x a ct lin e search approac h and a modified e xtragradient approach for ide n tifyin g th is unique
equilibriu m. E mpiric a lly , both approache s perf orm quite simila r ly .
W e der i v ed Nash lo g istic re gression and Nash sup por t v ector ma c h in e models and k ernelized
v er sions of these me thods.
Co r ollarie s 1 1 and 15 sp e cia lize T heore m 8 and e xp ound conditions
on the pla ye r’ s re gula rization pa rame ters under whic h the Nash logis tic re gression a nd th e s u pport
v ecto r ma chine con v er ge on a unique Nash equilibrium.
Empiric ally , we find that both methods
identify u niq ue N a sh equilibria when the bo unds la id out in Corollaries 11 and 15 are satisfi e d or
violate d by a f actor of up to 4. F rom our e x pe riment on se v eral ema il c orpora we conclude that Nash
logistic re gression and th e s u pport v ector ma c h in e outper
thelin
ir es and I n v a r -SVM
i.i.dform
.base
for the proble m of classif ying futu re e ma ils based on training data from th e past.
Ackno wledgments
This w or k w as supported by the Germa n Scie nce F oundation ( D FG )under gr ant
SC HE 540/12- 1 and b y S TR AT O A G. W e tha nk Nie ls Landw ehr and C hristoph Sa w ade for constructi v e comme nts and sugg e stio
andns,
the anon ymous r e vie wer f or helpf ul contrib utions and
car eful proofre ading of the ma nuscr ipt!
2652
S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS
Ref e r e nces
T amer B a sar and Geert J . O Dynam
lsder . ic N oncooper ative G ame T. heory
S ociety for I n dustria l
and Applie d Mathematic s , 1999.
Ste phen B o yd a n d Lie v en Vandenber
Co n ghe.
ve x Optimiz ation
. Cambr idge U ni v ersity P r ess, 2004.
Mic ha el Br ¨uckner and T obia s S chef
Nashfeequilibria
r.
of sta tic pre dic tion g ame
In A s.
dv a nc es in
N e u r al I nform ation Pr ocessin g .System
MI T Press,
s
2009.
Mic ha el Br ¨uckner and T obia s S cStack
h e f elber
f er . g g ames for adv ersarial predictio n proble
In ms .
P r oce edings of the 17th A CM SIG KD D Inte rnational Co nfer ence on K nowle dg e D isco very and
D ata Minin g (KD D), San Die go, C A,. AUCSA
M, 2011.
O f er Dek el and Ohad Shamir
L ear
. ning to cla ssify w ith missing and corrupted f eature
I n Prs.oceedings of the I nte rnation al Confe r e n c e on M ac hine
.AC
L eM,a 2008.
r n in g
O f er D e k el, O had S hamir , a n d Lin X iao. Lear n in g to c la ssify with missing and corrupted fea tures.
Mac hin e Learning
, 81(2):1 4 9–178, 2010.
Carl Geiger and Chris tian K a n z Theorie
o w . und Num erik r estr in g ierte r O ptimierungsa. ufg aben
Sp r inger , 1999.
Laurent El Ghaoui, Gert R. G. L anckr ie t, and Geor ges Nats Rob
oulis.ust classification w ith inte r v al data
T .echnic al Report UC B/CSD -03-1279, EE CS D epar tmen t, U ni v er sity of California,
B e rk ele y , 200 3.
Amir G lo b e rson and S a m T . RNo ig
w heis.
tmare at te st time: R ob u s t learning by featu re de le tion.
In P r oceedin gs of the Inte rnational C onfer ence on Mac hine. ALearning
CM , 2006.
Amir Globerson, Choon Hui T eo,A le x J. Smola,and Sam T . Ro weisA. n adv ersa ria l vie w of
co v a ria te shift and a min imax approach.
In Dataset Shift in Mac h in e L earning
, pa g e s 179–198.
MIT Pres s, 20 09.
P atric k T . H ar k er and J o ng- S hi P ang. Finite- d imensional v ariatio nal in equality a n d nonlin ea r c o
ple menta rity pr ob lems : A surv e y o f the o r y , algorithms andMathematic
applications.
al Pr o gr amm in ,g48( 2):161–220, 1990.
Rog e r A . Horn a nd Charles R. Johnson.
Topic s in Matrix An alysis
. C amb ridge U ni v ers ity Press,
C a mbridge, 1991.
James T . K w ok and Iv or W. T sang.
Th e pre- ima ge proble m in k ernel methods.
I n Pr oc eedings
of Interna tional Confe r ence on Mac h in e L earnin
, pagesg40 8–415. AA AI Press, 2 003.
ISB N
1-57735-189-4.
Gert R. G. Lanckriet, Laure n t El Gh a ou i, C hiranjib B hattachar yya, and M ichael I. Jordan. A rob u st
min imax approach to classification.
J ournal of Mac hine L e a rnin g Resear
, 3:5 55–
c h 582, 2002.
Sebastian Mika, Bernhar d Sch ¨olk opf, Ale x J. Smola , K laus-Ro be rt M ¨uller , Matthias S c h olz, and
G unnar R ¨ats ch. K ernel P CA and de–noising in fe atureAdvances
space s. in
In N eur al I nform ation
P r oc essing Syste
. MIT
ms Press, 19 99.
2653
B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R
J. B en Rosen. Exis tence and uniquenes s of equilibriu m points f or conca v e n-per
Econoson g ame s.
m e tr ic
, 33(3)
a
:520–5 34, 1965.
Bernhard Sch ¨olk opf and A le x J. SLmola.
earnin g with K ernels
. The MIT P r ess, C a mbridge, MA,
2002.
Bernhard Sch ¨olk opf, R alf H er bric h, and Ale x J. SAmola.
g e neralized repre sente r theorem.
In
C OL T : Pr oceedings of the Workshop on C omputatio nal L e a r n inMor
g Tgan
he oKaufm
ry , ann
P ublisher, s20 01.
Ch r is tian Sie fk es, Fidelis A ssis, Sh a le ndra C hhabra, and W illiam S. Ye razunis . Combining W inno w
and orthogonal sparse bigra ms f o r inc reme ntal spam filtering.
P r oceeI n
dings of the 8th E ur opean
, v olu me(PK DD )
C onfer ence on Princ iple s and P r a c tic e of K nowle dg e Dis co ve ry in D atabases
3202of L ec tur e N otes in A rtificial I nte llig
, paence
g e s 410–421. Springer , 2004.
Ch oon H ui T e o, Amir Globerson, Sam T . Ro weis , and Ale x J. S mo la . C on v e x le ar ning w ith in v a
ances. InAd v an c es in Neur al Informatio n P r ocessin. gMSystem
I T Press,
s 2007.
An na v o n Heusinger and C hr is tian K anzo w . Relaxatio n me th o ds f or generalized N a sh equilibr ium
proble ms w ith ine xact lin e searJch.
, 143(1):
ournal of Optimiz ation Theory and A pplications
159–1 83, 2009.
2654
Download