J ou rnal of Machine L ea rning Rese arch 1 3 (201 2) 2 617-26 54 S u bmit ted 8/11; Re vised 5/12; P u bli sh ed 9 / 1 2 S tati c Pr e d iction Games f or Ad v er s arial Le ar ni ng Pr obl e ms Mich ael Br ¨u c k ner MIBRUECK@CS. UNI-POTSDAM. DE D epa r tme n t o f C omp ute r Sci ence Un i ve r s it y o f P ots d am A u gu s t- B e b e l- S t r. 89 14482 P o t sda m, Ger ma ny C hristian Kan z o w KANZOW @MATHEMATIK. UNI-WUERZBURG.DE Ins tit u t e o f Ma t h e m ati cs Un i ve r s it y o ¨ f W u r zb u r g Em i l- F i sc h e r - S t r. 3 0 97074 W¨u r z b ur g , Ger ma ny Tob ia s Sch e ffer S C H E FF E R @ C S . U N I - P O T S D A M . D E Depa r tment o f C omp ute r Sci ence Un i ve r s it y o f P ots d am A u gu s t- B e b e l- S t r. 89 14482 P o t sda m, Ger ma ny Editor: N i col `o Ces a-Bi anchi Abst r a ct T he s ta n dard a ss u mpti o n o f id e n t ic all y dis tr ib u t ed tr ai n i n g a n d t est d a ta i s viola ted wh e n t h d at a a re g e n er at ed i n r es p on s e t o t h e p r es ence o f This a prebec d i cti o me v esmod app ea l.rent, for e x a mp l e, i n the con t e xt of emai l s p a mHer filt e, er ema in g. il s er v i ce pro v i d er s e mp l o y spam filter s, a n d s p a m s end e rs eng i n e er ca mp ai g n t empla tes to ac h i e v e a high rat e of s u cc es sf u l d e d es pite t h e filt ers . W e mod el t h e i n t era ct io n b et wee n the lea rner and t h e d a ta genera tor as a st at g ame in wh ic h the cost fun c tions of the l ear n e r and the d a ta g e n e rat o r ar e no t n ec es sa ri ly antago nis tic W . e i d e n t ify c o nd i ti o ns un der which this p r edic tion g ame has a un i q ue Nas h equ il ibri u m an d d e ri v e al g ori thms that find the equ i libri al pre d i ct io n mo del . W e d er i v e tw o i n s ta n c es , t h e N lo gis ti c re g r es si o n a n d the Nas h s u pp ort v ec tor mac h i n e , a n d e mp i ri cal ly e xp l o r e thei r p r o p in a cas e s tud y o n ema il s pam filt eri n g. st at ic pre d i cti o n g a mes , adv e rs ari al cla ss ific ati o n, Nas h equ i li b r ium K e y w ords: 1. Intr oduction A commo n assumption on w hich most learning a lgorithms ar e based is th at trainin g and te st da ta are go v erned by id entical dis trib utions. H o w e v er , in a v ariety of applications, the distr ib utio n th at go v er ns data at application time may be influenced by an a d v ersar y w hose in te re sts ar e in confl ict with th ose of the le a rner Consider . , for instance, the f ollo w ing three scenar In compute ios. r and netw ork sec urityscripts , that contr ol a ttacks are engin eered with botne t and intrusion detectio n systems in mind . Credit car d defr auders a dapt th eir unauthorized use of cr edit cards—in particula r , amounts char g e d per tr ansa ctio ns and per day and the ty pe of b usines ses th at amounts are char ged fr om—to a v oid tr ig gering alerting me chanisms emplo y e d by cr edit ca rd E companies. mail spam sender s design me ssage templa tes that ar e ins ta ntiate d b y nodesThese of botne template ts . s a re c 2012M ich ael B r ¨u ckne r , Christi a n Kanz o w and T o bias S c hef fer . B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R specifica lly designed to pr od uc e a lo w spam score w ith p opular spam fi lters. T he doma in of ema il spam fi lter ing will ser v e a s a r unning e xample throughout th e paper . I n all of these applicatio ns, the party that cr eate s th e predic ti v e model and the a dv ersar ia l pa rty that generates future data are a w a of each o th er , and f actor the possible actions of the ir opp onent into the ir decis ions. The interaction between le a rner and data g e nerator s can be modeled a s a g ame in whic h one pla ye r c on trols th e pr edic ti v e mo del w her eas another e x ercises some control o v er the proc ess of data genera tio n. The adv ersa ry’ s influ e nce on th e gener atio n o f th e data can be formally modeled as a transformatio n th at is imposed o n th e distr ib utio n th at go v er ns the data at tr ain ing time. The tr a ns for me d distr ib utio n then go v erns the da ta at application T hetime. optimiz ation criterio n of either pla ye r tak es as a r gume nts bo th the pr edic ti v e model chosen by the le arne r and the tr ansf orma tion car rie d out by the adv ersa ry . T ypically , this proble m is modeled un de r the w orst-case a ssumptio n that th e adv ersa ry desires to imp ose th e hig hest possible costs on the le aTrner his a. mounts to a ze ro-sum g ame in whic h the lo ss of one pla ye r is the g ain of th e other In this . setting, both players c an maximize their e xpec te d outc ome by follo w in g a minimax strate L a gy nckrie . t et al. (2002) study the min imax probability machine (MPM). T his cla ss ifier minimizes the ma ximal pr o ba b ility o f mis c la ssifying ne w instances f or a gi v en mean and co v a ria nce ma trix of e Geome ach class. trically , th ese class me a ns and co v ariances define tw o h yper -ellipsoids whic h are equally scaled such that th e y intersect; their common ta ngent is the minimax probabilistic decision h yper p lane. Gh a ou i et al. (2003) deri v e a minima x mode l for input da ta that are kno wn to lie w ithin some h yper -rec ta ngle s around the training instances. Their solu tio n min imiz es the w o r st- c ase loss o v er all possible choic es of the da ta in these in te rv als S .imilarly , w o r st- c ase solu tio ns to classification g ames in whic h the adv ers ary delete s input featu res ( G lober son and R o w eis, 2006; Globerson e t al., 2009) or perfor ms an ar bitr a r fea ture tr a nsforma tion (T eo e t a l., 2007; D ek e l a nd S ha mir , 2008; D e k el e t a l., 2010) ha v e bee studied. Se v eral applic a tio ns moti v a te proble m se tting s in w hich the goals of th e le ar n e r and th e da ta genera tor , w hile still confl icting, ar e not nece ssarily entirely anta gonis tic. F or instance, a def rauder’ s goal of ma ximizing th e profi t ma de fr om e xplo iting ph ished account in forma tion is not the in v er se of an email s ervic e pro vider ’ s goal of achie ving a high spam recognition r ate at clo se-to- zero f alse positi v es. When pla ying a minimax strate gy , one often ma k es o v erly pessimistic assumptions about the adv er sary’ s beha vio r a nd ma y not nece ssarily obtain a n optimal outcome . Game s in w hich a le ader —typically , the learner —commits to a n actio n first w he reas the adv er sar y ca n re act after the leader’ s action has b e en disclosed are natu rally mo Sta dele c kdeas lber a g competitio .nThis model is appr o pr ia te wh e n the f ollo w er —the d a ta gener ator —has full in formation abou t the predicti v e mo This del. a ssumptio n is usually a pessimistic approximatio n of r eality because , for insta nce , neith e r email se rvic e pr o viders nor cr edit c ard companies dis c lose a c omprehensi v e docume ntatio n of th eir curr ent security measures. Stack elber g equilibria of adv er sarial classification pr oble ms can be identified by solving a bile v el optimiz ation pr o blem (Br ¨uckner and Schef fer , 2011). This pa p e r studies static pr edic tion g ame s in w hich both pla yer s act simu ltaneously; tha t is , with out prio r infor ma tion on their op ponent’ s mo v e . Whe n the optimizatio n criter ion of b oth players depends not only on their o wn action b ut also o n their op ponent’ s mo v e, then the conce pt of a pla ye r’ s optimal ac tio n is no lo n ge r w ell-defined. Theref o r e, w e r esort to the concept N of ash a equilibrium of static pr edic tion g ameAs. Nash equilibrium is a pair of ac tio ns chosen such th at no pla yer benefits f rom unilate r ally sele cting a dif fe rent action. I f a g ame has a uniq ue N ash e q ui2618 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS librium and is played by rational players tha t aim at maximizing the ir optimizatio n c riteria , it is rea son a ble f or eac h pla yer to a ssume th at th e opponent will play according to the N a sh equilibriu m strate gyIf .one player plays acc ording to the e q uilibr ium strate gy , the optimal mo v e for the other player is to play this equilibriu m str a te gy as Iwell. f, ho w e v er , multip le equilib r ia e xist and the play e rs choose their str ate gy accor d in g to distin c t ones, then the resulting combin atio n ma y be arbitrar ily disadv antageous f or either pla yer . I t is ther efore in te re stin g to study w heth er adv er saria predictio n g ame s ha v e a un iq ue Nash equilibriu m. Our w ork b uild s on an approach tha t B r ¨u c kn e r and S chef fe r (2009) de v eloped for finding a N a sh e qu ilibr ium of a static p r edic tion W g aeme w ill. dis cuss a fla w in Th e orem 1 of B r ¨uckner and S c hef f er (2009) and de v elo p a r e vised v ersion of the the o r em that id entifi es conditions under w hich a unique Nash equilibriu m of a predic tion g ame e xists . In addition to the in e xact linesea rch approac h to finding th e equilibriu m that B r ¨uc k ne r and S chef fe r ( 200 9) de v elop, w e w ill follo w a modified e x tragra die nt appr o a ch a nd de v elop Nash logis tic r e gressio n and the N a sh s u pport v ec ma c h inThis e. paper als o de v elops a k er nelized v ersion of these methods. An e xte nded empiric a l e v aluation e xplores the applicability of th e Nash instances in the conte xt of e ma il spam fi lterin g. W e empiric ally v er if y th e assumptions ma de in the mo delin g proc ess and compa re the pe rformance of N a sh instances w ith ba selin e me th ods on se v er al ema il corpora inclu ding a corpus from an ema il service p r o vider . The re st of this paper is or g anized as f olloSection w s. 2 in tr oduces the problem setting. We formalize the N as h predictio n g ame a nd study conditions un de r w hich a unique N ash equilibriu m e xists in Section 3. Section 4 de v elops strate g ies for identifying equilibria l pre d iction models, and in Section 5, w e deta il on tw o instances of the Nash predictio n Igname. S e ctio n 6, we re p or t on e xper ime nts on e ma il spam filtering; Sectio n 7 conclu des. 2. Pr oblem Sett i ng W e stu d y sta tic predictio n g ames between tw o players: The (v = q 1) and an a d v ersar y , the le arner =+ (v 1). In our runnin g e xample of ema il spam filtering, we study the c o mp etitio n data g ener ator between r ecipient a n d senders, not comp etitio n amo ng senders. Therefor e,v = q 1 re fers to the =+ 1 models the entirety of all le gitimate and ab usi v e ema il senders a s a single, rec ipie nt w her veas amalg ama ted player . At tr aining time , the data generator v =+ 1 p r odu c es a sample D = { ( xi , yi ) } ni = 1 of n tr a ining ∈Y instancesxi ∈ X w ith corr espond in g class layibe ls = {q 1, + 1} . These object- c la ss pairs a re dra w n a ccording to a training distr ib ution with density function p( x, y) . B y contr ast, at applica tion time the da ta gener ator produces object-cla ss pairs acc ording to some test distr ib utio n with density ( xm , y) . p˙( x, y) w hich ma y d if f er frpo The task of the learnev r= q 1 is to sele c t th e paramete w ∈r W s ⊂ Rm of a predicti v e mo d e l h( x) = sign fw ( x) imple men ted in terms of a gener alized lin e ar dec is ionfwfunction : X → R w ith T m φ : X → R . T he le ar ner’ s theor etical fw ( x) = w φ( x) a n d featu re ma pping costsat applicatio n time are gi v e n by θq 1( w, p˙) = ∑ Y Z X cq 1( x, y) ℓ q 1( fw ( x) , y) p˙( x, y) dx, w he rew eig htin function g cq 1 : X × Y → R and loss function ℓ q 1 : R × Y → R c omposethe weighte d lossqc1( x, y) ℓ q 1( fw ( x) , y) that th e learner in curs w hen the predicti v e model classifies 2619 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R instancex a sh( x) = sign fw ( x) while th e tr ue label yis. T he positi v e class- a n d ins ta nce- specific weightin g f a ctors cq 1( x, y) w ithEX ,Y [cq 1( x, y)] = 1 spec if y the imp ortance o f minimizing the loss ℓ q 1( fw ( x) , y) for th e corresponding object- c la ss ( xpair , y) . F or instance, in spa m fi lterin g, the cor rec t classification of non-s p a m me ssages c an be b u sin ess-cr itical f or ema il service pro vide rs w hile f a iling to detect s p a m messages runs up process ing and stor age c o sts, depending on the siz e of the me ss age. The data gener ator v =+ 1 ca n modif y the data ge n e ration process at application time. In pra ctice, spa m sender s update their campa ign templa tes wh ich are dis se min ated to th e nodes of botnets. F ormally , the d a ta ge n e rator transforms the training distr ib ution wp ith to the density te st dis trib ution w ith density by modifying the data genera tio n p˙. T he d a ta gener ator in tr curs ansforma tion costs proces s whic h is quantifi e Ω d+by ˙) . This term acts a s a r e gula r iz er on the transformatio n and 1( p, p ma y implic itly c o nstrain th e p ossib le dif f erence betwee n the distr ib utio ns a t training and applicatio n time, dependin g on the natur e of the application that is to be modele d. F or insta nc e, the ema il sender ma y not be a llo w ed to alter th e tr ain ing dis trib ution for non-spam me ssages, or to mo dif y the natu re of the me ss ages b y changing the la be spam l from to non-spamor vic e v er sa. A dditionally , changing the trainin g dis trib ution for spam me s sages may in cur costs depending on the e x tent of d istortion inflicte d on th e in formatio nal pa y lo The ad.theore tic a costs l of the data g e nerator a t applicatio n time are the sum of the e xpected predictio n costs and th e transfor ma tion costs, θ+ 1( w, p˙) = ∑ Y Z X c+ 1( x, y) ℓ + 1( fw ( x) , y) p˙( x, y) dx + Ω+ 1( p, p˙) , wh e re,in analogy to th e learner ’ s costs c+ 1,( x, y) ℓ + 1( fw( x) , y) quantifi e s th e weighted los s th at the data generato r incur s w hen instance x is la be le d as h( x) = sig nfw ( x) w hile the true la bel is ( , ) [ ( , )] = , y) fr om y. The w eig htin g f actors c+ 1 x y w ithEX ,Y c+ 1 x y 1 e xpress the sig nificance( xof the perspecti v e of the data g e nerator . In our e xamp le sce nario, this re fl ects tha t c o sts of corr ectly o incor rectly classified in sta nces ma y v ary gre atly across dif ferent ph ysic a l sender s tha t are aggre g at into the amalg ama ted play e r . Since the theor etical costs of bo th players depend on th e test dis trib ution, th e y can, for all practical pu r poses, not b e calc ulate d. Hence, we focus on a r e gula r iz ed, empir ical c ou nterpart of the th eˆ + 1( D, D˙ ) of the data genΩ oretical costs base d on th e tr ain ing sample D. T he empiric al c o unterpart Ω+ 1( p, p˙) penalizes the di v er gence between tr a ining D { ( xi , yi ) } ni = 1 er ator’ s re gula rizer sa=mple n and a pertur bate d trainin g samp D˙ = {le( x˙ i , yi ) } i = 1 that w ould be the outc ome of a pp ly ing the transfor ma tion tha t translate s p˙ to sampleD. The learner ’ s cost function, instead of inte gra tin g p into o v ep˙r, sums o v er the ele ments o f th e pertu rbated trainingD˙sample . The players’ emp ir ical cost functions ca n still only b e e v aluated after th e le a rner has committed to par w and ame the ters d a ta genera tor to a transformatioHo n. w e v er this transfor ma tion needs only be represe nte d in terms of the ef f ects that it will ha v e on the tr ain ing sample D. The transformed training s ample D˙ must not be mis tak en for te st data; test data are genera tep˙daun t applic de r a tio n time after the pla yer s ha v e committed to their a ctio ns. 2620 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS w and The empiric al costs incur red by th e predic ti v eh(model x) = sig nfw ( x) with par ame ters the shif t f rom p to p˙ a mount to θˆ q 1( w, D˙ ) = θˆ + 1( w, D˙ ) = n ∑ cq 1,i ℓ q 1( fw( x˙ i ) , yi ) + ρq 1Ωˆ q 1( w) , ( 1) ∑ c+ 1,i ℓ + 1( fw( x˙ i ) , yi ) + ρ+ 1Ωˆ + 1( D, D˙ ) , ( 2) i= 1 n i= 1 wh e re we ha v e replaced the weighting1n cterms ˙ i , yi ) by c o nstant cost f accvtors ,i > 0 with ∑i cv,i = v( x ˆ ˙ Ωq e1(rw) in (1) accounts for the f a ct that 1. The le a rner’ s re gulariz D does not constitute the test data itself, b ut is merely a training sample transformed to reflect the test dis trib ution a nd then used to learn the model parametewrs. Th e tr ade- of f between the empirical loss and the r e gula r iz er is ρva>rame controlle d by eac h pla yer ’ s re gulariz a tio n p 0 for ter v ∈ {q 1, + 1} . ˆ θv depend on b oth pla yer s’ actions: w ∈ W and D˙ ⊆ Note that either player’ s emp ir ical costs nd D˙ f o r X × Y . B e cause of th e pote ntially confl ictin g players’ interests , the dec is ion wpraocess becomes a non-cooperati v e tw o- pla yer g ame, whic h we pr edictio call a n g ame . In the follo wing section, we will refer to the N ash pr edictio n gam (N PG) e w hich identifi es the concept o f an optimal mo v e of the le ar ner and the data generato r under the assumption of simulta ne o usly ac tin g players. 3. The Nash Pr edi ction Game , D˙ ∗ ) that incur s c o sts w∗actions The outcome of a predictio n g ame is one par tic ular combination (of ∗ ∗ θˆ v( w , D˙ ) for the players. Each player is a w are that this outc ome is afboth fected player by s ’ ac tio n and tha t, consequently , the ir potential to choose an ac tio n ca n ha v e an imp a ct on the other pla yer ’ s decision. In gener al, the re is no action that minimizes one pla yer ’ s cost f un c tio n indepe n de n t o f the other pla yer ’ s action. In a non-cooper ati v e g ame, the play e rs ar e not allo wed to c o mmunic ate w hile ma kin g their decis io ns and ther efore the y h a v e no inf orma tion about the other pla yer’ s strate gy . In this setting, an y conce p t of an optimal mo v e r equir e s additional a ssumptio ns o n ho w th e adv ersa ry will act. W e model the decis io n proces w s ∗f o and r D˙ ∗ as a sta tictw o-player g ame w citho mple te in formatio n . In a static g ame , both players commit to an action simultaneously , with out in formatio n about their opponent’ s action. In a g amecom w ithplete info rm ation , b oth pla ye rs kno w th eir opponent’ s cost f un c tio n and action space . When θˆ q 1 and θˆ + 1 are kn o w n and anta gonistic , th e assumption that th e adv er sary will seek the greatesta dv anta ge by inflic ting th egreatestdama geon θˆ q 1 justifi esthe m in im ax str ate gy : ˆ ˙ θ ( , ) ar gmin w ma x D˙ q 1 w D . H o w e v er , w hen the pla yer s’ cost functio ns ar e n ot antagonis tic, assuming that th e a dv ersar y will inflic t the gre ate st possible dama ge is o v erly Inste pessimistic. a d assuming that the adv ersa ry ac ts ra tio nally in the se nse of seeking the gre ate st possible personal adv anta ge leads to the concept ofNaash equilibrium . An equilibriu m strate gy is a ste ady sta te of the g ame in wh ich neither pla yer has an inc enti v e to unilate r ally change th eir plan of actio ns. I n static g ame s , equilibrium strate g ies are called Nash equilibr ia , whic h is wh y we re fer to the resulting pr edic ti v e modelNash as pr edic tion gam( N e PG ) . In a tw o- pla yer g a me , a N ash e q uilibrium is defined a s a pair of ac tio ns such th at no player can b e nefi t f rom changing their action 2621 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R unilate ra lly ; that is, θˆ q 1( w∗ , D˙ ∗ ) = θˆ + 1( w∗ , D˙ ∗ ) = ∗ min θˆ q 1( w, D˙ ) , w∈ W ∗ min θˆ + 1( w , D˙ ) , D˙ ⊆X× Y wh e re W andX × Y de n ote the pla ye rs’ action space s. Ho we v er , a sta tic predictio n g ame ma y not ha v e a N ash equilibriu m, or it ma y possess multiple equilibria. I f ( w∗ , D˙ ∗ ) and ( w′ , D˙ ′ ) are dis tinct Nash equilibria and each pla yer decid es to a ct ( w∗ , D˙ ′ ) and( w′ , D˙ ∗ ) ma y in cur a rbitrarily acc o r ding to a dif ferent one of them, then c ombinations high costs for both p layers. Hence, o ne c an ar gue tha t it is ra tio nal for an a d v ersar y to pla y a Nash equilibriu m only w hen the f ollo w ing assumption is sa tis fied. Assu mptio n 1The following statem e nts hold: 1. bo th play er s act simultaneously ; ˙ ) defined θˆ v( w 2. bo th player s ha v e full knowle dg e about both ( em piric al) cost fu, D nctio ns in( 1) an d(2), and b oth action spaces W and X × Y ; 3. bo th player s act r atio nal with r e spect to their cost function in the sense of se curing the ir low e st possib le costs ; 4. a unique Nash equilibriu m e xists . Wheth er A ssumptions 1 .1 -1.3 are adequate—especially the assumption of simu ltaneous actions— stron gly depends on the applic a tio n. F or e xample, in some applicatio ns, the data generato r may u nilate rally be able to a cquir e infor ma tion about thefwmode beforel committing to D˙ . Such situations and S chef fe r , 2011). O n the other hand, are better modeled asStac a k elb er g comp e(Br titio¨uckner n wh e n the le ar ner is able to treat an y e x ecute d action as part of the trDain a ning d update data the model w, the setting is better mo dele d as repeated e x ecutions o f a static g ame w ith simu ltaneous actions. T he a dequate ness of A ssumption 1.4, w hic h we dis c u s s in th e follo w in g sections, depends on th e chosen loss functions, the c o s t f actor s, and the re gu larizers. 3.1 Exis te n ce of a Na sh E quilib rium Theorem 1 of B r ¨u c kn e r and Schef fe r ( 2 009) identifies c o nditions und e r whic h a unique Nash e q u librium e xis ts.K anzo w loc ate d a fla w in the proof o f this theore T he m: proof ar g ue s that the pseudo-Jac o bian can be decomposed into tw o (str ictly) positi v e stable ma tric es by sho w in g th at the rea l par t of e v er y eige n v a lue of those tw o matr ices isHo positi we vv er e. , th is do e s not generally imply that th e sum of these matr ices is positi v e sta ble as w e ll sinc e this w ould require a commo n L yapuno v solution (cfProblem . 2.2 .6 of H or n and Johnson, 199But 1) e. v en if such a solution e xists , the positi v e de fi niteness cannot be conclude d fr om the positi v eness of all eigen v alue s as the pseudo-Jac o bian is gener ally non-symmetr ic. Ha ving “unpro v en” prior claims , w e w ill no w deri v e suf ficient conditio ns for the e xis tence of a Nash equilibriu m. T o th is e n d, we first define x := hφ( x ) T , φ( x ) T ,..., φ( x ) T i T ∈ φ( X) n ⊂ Rm∙n, 1 2 n x˙ := hφ( x˙ ) T , φ( x˙ ) T ,..., φ( x˙ ) T i T ∈ φ( ) n ⊂ Rm∙n, X 1 2 n 2622 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS as long, conca te nated, column v ector s induced by featur φe, mapping training sample D = { ( xi , yi ) } ni = 1, and transformed trainin g sample D˙ = { ( x˙i , yi ) } ni = 1, r especti v ely . F o r termin olo gic al harmon y , we re φ( Xn) nspace fer to v ectox˙ras the da ta gene rator’ s actio n w ith cor responding actio . W e mak e the follo w in g assumptions on th e action spaces and the c ost functions whic h ena b les us to state the ma in result on the e xis tence of at le a st on e N a sh equilibrium in Lemma 1. A s su mptio nThe 2 player s’ cost function s defined in Eq uations 1 a nd 2, and their action W sets and φ( X) n satis fy the pr operties: 1. loss functionsℓ v( z, y) with v ∈ {q 1, + 1} ar e con ve x and twic e continuou sly dif fer entiable ∈R w ith r espec t to z for all fixed y∈ Y ; ˆ vsar e uniform ly str ongly con ve x and twice contin uously dif fer entiable with r e Ω 2. r e gula riz er spect tow ∈ W a ndx˙ ∈ φ( X) n, r espectively ; 3. a c tio n space W sandφ( X) n ar e non-e m pty , com pact, and con ve x subsets of finite - dim e nsional E uclidean spaceRm s and Rm∙n, r espectively . ( w∗ , x˙ ∗ ) ∈ W × φ( X ) n of the N ash Lemma 1 Under Assum ption 2, at least on e equilibrium point pr edictio n ga me defined by min θˆ q 1( w, x˙ ∗ ) min θˆ + 1( w∗ , x˙ ) x˙ w s.t. x˙ ∈ φ( X) n s.t. w ∈ W ( 3) e x is ts . ℓ v and er te r ms r esulting fr om lo ss function Pr o ofE. ach pla yer v’ s cost functio n is a sum onvloss ˆ Ω re gula rizer v. B y A ssumption 2, th ese lo ss f un c tio ns are con v e x and c o ntinuous, and th e re gu˙ ) and lariz er s are uniformly strongly con v e x and continuo Hence, us. both cost f u nc tioθˆ qns 1( w, x θˆ + 1( w, x˙ ) ar e contin uous in a ll ar gu ments a n d uniformly str onglywc∈o W n vand e xx˙in∈ φ( X) n, res p e cti v ely A s. both ac tio n space Wsand φ( X) n are n on- empty , compact, and con v e x subsets of fi nite-dimensional Euclidean spaces , a N ash equilibrium e xis ts—see T heore m 4.3 of B asar and Ols der (1999). 3.2 Un iqueness of t h e Na s h E quilib rium W e w ill no w deri v e c on ditions f or the unique n e ss of an equilibriu m of the N ash pre d iction g a me ( nin + to defined in (3). W e fi r st ref ormulate the tw o-player g ame 1) -an pla yer g ame . I n Lemma 2, w e then prese n t a suf fi cient condition for th e uniquenes s of the N a sh equilibrium in this g ame, and by applyin g Proposition 4 and Lemma 5-7 we v erify whether this condition is me t. F in ally , we state the main re sult in T heor em 8: T he Nash equilibr ium is un iq ue un de r cer ta in pr o pe rties of the lo ss functions, the re gulariz e rs, and th e c o st f actor s whic h all can be v er ified easily . ) n, space T akin g into account the C ar te sian product str uctu re of the da ta gener ator’ φs( X action T it is not dif ficult to see tha( wt ∗ , x˙ ∗ ) with x˙ ∗ = _x˙ ∗1T ,..., x˙ ∗nT _ and x˙ ∗i := φ( x˙ i∗ ) is a solu tio n of the 2623 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R ∗ ∗ ( wif, , x˙ 1,..., x˙ ∗n) is a N as h equilibrium of the ( n + 1) -pla yer g a me tw o-player g a me if , and on ly defined by min θˆ q 1( w, x˙ ) w s.t. w ∈ W min θˆ + 1( w, x˙ ) ∙∙∙ min θˆ + 1( w, x˙ ) s.t. x˙ 1 ∈ φ( X) ∙∙∙ s.t. x˙ n ∈ φ( X) x˙ 1 x˙ n , ( 4) θˆ + 1nand min imiz ing th is functio n w ith wh ich r esults f rom (3) by r epeating n times th e c ost f unctio respe ct to x˙ i ∈ φ( X) f o ri = 1,..., n. T he n the pseudo-gr adient (in the sense of R osen, 1965) of the g ame in (4) is defi ne d by gr ( w, x˙ ) := r 0Ñwθˆ q 1( w, x˙ ) r 1Ñx˙ 1 θˆ + 1( w, x˙ ) r 2Ñx˙ 2 θˆ + 1( w, x˙ ) .. . ˆ r nÑx˙ n θ+ 1( w, x˙ ) ∈ Rm+ m∙n, ( 5) w ith a n y fix ed v ec r =[ torr 0, r 1,..., r n]T w herer i > 0 for i = 0,..., n. The deri v a ti v ger —tha o f t is , the pseudo-J acobian of (4) — is gi v en by Jr ( w, x˙ ) = Λr _ Ñ2w,w θˆ q 1( w, x˙ ) Ñ2w,x˙ θˆ q 1( w, x˙ ) _ , Ñ2x˙ ,w θˆ + 1( w, x˙ ) Ñ2x˙ ,x˙ θˆ + 1( w, x˙ ) w he re Λr := r 0I m 0 0 r 1I m .. .. . . 0 0 ∙∙∙ ∙∙∙ .. . 0 0 .. . ∈ R( m+ m∙n) × ( m+ m∙n) . ( 6) ( 7) ∙ ∙ ∙ r nI m N ote that th e pseudo-gradient gr and the pseudo-Jac ob Jian r e xis t w he n Assumptio n 2 is satisfied. The abo v e definition of th e pseudo-Jac ob ian enables us to state the follo wing result about the uniqueness of a Nash equilibr ium. Lemma 2 Let Assum ption 2 hold and suppo se th er e e xists a fixed r =[ vecto r 0, r 1r,..., r n]T w ith x˙ ) ia r i > 0 for all i = 0, 1,..., n suc h that the corr e spo ndin g pseudo-JJar ( cw,ob is n positive definite n for all ( w, x˙ ) ∈ W × φ( X) . Then the Nash pr edictio n gam( e3)in has a unique equilibrium. Pr o of The . e xis tence of a N a sh e qu ilibr ium follo ws from L emma R ec1. all fr om our pre vious dis cussio n that th e origina l Nash g ame in ( 3 ) has a unique solution if, and only if , the g ame from ( 4 ) w ith one le ar ner and n d a ta genera tors admits a unique solution. In vie w of Theorem 2 of Rosen gr is strictly monotone ; th at is, if (1965), the latter attains a unique solution if th e pseudo-gradient for all a ctio ns w, w′ ∈ W andx˙ , x˙ ′ ∈ φ( X) n, th e in equality t gr ( w, x˙ ) q gr ( w′ , x˙ ′ ) _T __ w _ _ q x˙ w′ __ > 0 x˙ ′ holds. A suf ficient conditio n for th is pseudo-gr adie nt being stric tly monotone is th e positi v e defiJr (se e, e.g ., Theorem 7.11 and T he o r em 6, r especti v e ly , in G eig er niteness of the pseudo-Jacobian 2624 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS and Kanzo w , 1999; Rosen, 1965 ) . T o v e rif y whether th e po s iti v e definiteness condition of L emma 2 is satisfi e d, we first der i v e the Jr ( w, x˙ ) . W e subsequently decompose it into a sum of thre e ma trices and a n a lyze pseudo-Jac o bian cq 1,i the definiteness of these matr ices for th e par ticula r choicerowith f v ector r 0 := 1, r i := c+ 1,i > 0 f or all i = 1,..., n, with corr esponding matr ix Im 0 .. . 0 Λr := 0 cq 1,1 c+ 1,1 I m .. . 0 ∙∙∙ ∙∙∙ .. . ∙∙∙ 0 0 .. . . ( 8) cq 1,n c+ 1,n I m This fi nally p r o vides us with suf ficie nt conditions w hic h ensure the uniqueness of the N ash equilibriu m. 3.2.1 D E R I V A T I O N O F T H E P S E U D O - J A C O B I A N ( z, y) a ndℓ ′′v ( z, y) the first and second de ri v ati v e of the Through out this sec tio n, we denoteℓ ′vby ma ppin ℓgv( z, y) with respect toz ∈ R and use the a bb r e viatio ns ℓ ′v,i := ℓ ′v( x˙ Ti w, yi ) , ℓ ′′v,i := ℓ ′v′ ( x˙ Ti w, yi ) , for both playersv ∈ {q 1, + 1} andi = 1,..., n. T o sta te the p s eudo - Jacobia n for th e emp ir ical c o sts gi v en in (1) and (2) , we first der i v e their first-order partial deri v a ti v es, Ñwθˆ q 1( w, x˙ ) = n ′ cq 1,i ℓ q 1,i x˙ i + ρq 1ÑwΩˆ q 1( w) , ∑ = ( 9) i 1 Ñx˙ i θˆ + 1( w, x˙ ) = ′ ˆ + 1( x, x˙ ) . c+ 1,i ℓ + 1,i w + ρ+ 1Ñx˙ i Ω ( 10) This allo w s us to calcula te the entrie s of th e pseudo-Ja cobia n gi v en in (6), Ñ2w,w θˆ q 1( w, x˙ ) = n ′′ ∑ cq 1,i ℓ q 1,i x˙ i x˙ Ti + ρq 1Ñ2w,wΩˆ q 1( w) , i= 1 Ñ2w,x˙ i θˆ q 1( w, x˙ ) Ñ2x˙ i ,w θˆ + 1( w, x˙ ) = ′′ ′ cq 1,i ℓ q 1,i x˙ i wT + cq 1,i ℓ q 1,i I m, = c+ 1,i ℓ + 1,i w ˙xTi + c+ 1,i ℓ + 1,i I m, Ñ2x˙ i ,x˙ j θˆ + 1( w, x˙ ) = ′′ ′ ˆ + 1( x, x˙ ) , δi j c+ 1,i ℓ ′′+ 1,i wwT + ρ+ 1Ñ2x˙ i ,x˙ j Ω δi j d e no tes Kroneck er’ s delta whic hi equals w he re is 1 if j a n d 0 other w ise. W e can e xpress the se equatio ns mor e c ompact as ma trix equations. There fore,we use the Λr as defined in (7) and set Γ v := diag( cv,1ℓ ′′v,1,..., cv,nℓ ′′v,n) . A dditionally , we define dia gonal ma trix T X˙ ∈ Rn× m as the ma trix w ith rox˙w s x˙ Tn , andn matric esW i ∈ Rn× m with a ll entrie s set to zero 1 ,..., 2625 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R T e xce pt f or th i -th e ro w whic h is se twto . T hen, Ñ2w,w θˆ q 1( w, x˙ ) = Ñ2w,x˙ i θˆ q 1( w, x˙ ) = Ñ2x˙ i ,w θˆ + 1( w, x˙ ) = Ñ2x˙ i ,x˙ j θˆ + 1( w, x˙ ) = ˆ q 1( w) , X˙ T Γ q 1X˙ + ρq 1Ñ2w,w Ω ′ X˙ T Γ q 1W i + cq 1,i ℓ q 1,i I m, ′ W Ti Γ + 1X˙ + c+ 1,i ℓ + 1,i I m, ˆ + 1( x, x˙ ) . W Ti Γ + 1W j + ρ+ 1Ñ2x˙ i ,x˙ j Ω Hence, the p se ud o- Jacobia n in ( 6) can be sta ted as follo ws, T _ X˙ 0 ∙ ∙ ∙ 0 _ _ Γ q 1 Γ q 1 __ Γ+ 1 Γ+ 1 0 W1 ∙ ∙ ∙ Wn ˆ q 1( w) ρq 1Ñ2w,w Ω cq 1,1ℓ ′q 1,1I m ˆ + 1( x, x˙ ) ρ+ 1Ñ2x˙ 1,x˙ 1 Ω c+ 1,1ℓ ′+ 1,1I m Λr .. .. . . ′ 2 ˆ + 1( x, x˙ ) ρ+ 1Ñx˙ n,x˙ 1 Ω c+ 1,nℓ + 1,nI m Jr ( w, x˙ ) = Λr X˙ 0 ∙∙∙ 0 _ + 0 W1 ∙ ∙ ∙ Wn ∙∙∙ cq 1,nℓ ′q 1,nI m ˆ + 1( x, x˙ ) ∙ ∙ ∙ ρ+ 1Ñ2x˙ 1,x˙ n Ω . .. .. . ˆ + 1( x, x˙ ) ∙ ∙ ∙ ρ+ 1Ñ2x˙ n,x˙ n Ω . W e no w aim a t decomposing the rig ht- hand e xpre ssion in order to v er if y the definiteness of the pseudo-Jacobian. 3.2.2 D E C O M P O S I T I O N O F T H E P S E U D O - J A C O B I A N T o v er if y the positi v e definiten e ss of the pseudo-Jac ob ian, we f urther decompose the se cond summa nd of the abo v e e xpression into a positi v e semi-definite and a str ictly positi v e definite matr ix . Therefor e, le t us d e no te th e smallest eige n v a lues of the H essians of th e r e gula r iz er s on the corre sponding ac tio n spaces W andφ( X) n by λ q 1 := λ + 1 := ˆ q 1( w) _, in f λ min t Ñ2w,w Ω ( 11) ˆ + 1( x, x˙ ) _, inf λ min t Ñ2x˙ ,x˙ Ω ( 12) w∈ W x˙ ∈ φ( X) n λ mi n( A ) denotes the sma llest e igen v alue of the symmetricA .ma trix w he re R e mark 3N ote that the min imum in ( 11) and ( 12) is attained and is stric tly positi v e: T he ma pping k× k λ min : M k× k → R is conca v e on the set of symmetr ic maMtric eofs dimensionk × k (cf. E xample 3.1 0 in Bo yd and Vandenber ghe, 2004), a n d in par tic ular , it th erefor e follo w s th at this ma pping ˆ q 1( w) and is continuous. Further more, th e ma ppings uq 1 : W → M m× m with uq 1( w) := Ñ2w,w Ω ∙ × ∙ ˆ + 1( x, x˙ ) are continuo us (f o r an y xfi) xbye A d ssump u+ 1 : φ( X) n → M m n m n with u+ 1( x˙ ) := Ñ2x˙ ,x˙ Ω tion 2. H ence , the ma ppings w 7→λ mi n( uq 1( w)) a n dx˙ 7→λ min( u+ 1( x˙ )) a re a ls o contin uous sin ce λ min ◦ uv of the continuo us functions λ min anduv for v ∈ { q 1, + 1} . each is pr ecis ely the composition T aking into account that a c on tinuous mapp in g on a non-empty compac t se t a tta in s its minimu m, it follo ws th at there e xist e le men w∈W ts andx˙ ∈ φ( X) n such that λq 1 = λ+ 1 = ˆ q 1( w) _, λ min t Ñ2w,w Ω ˆ + 1( x, x˙ ) _. λ min t Ñ2x˙ ,x˙ Ω Moreo v er , sin ce the H e ssia ns of the r e gula r iz er s are positi v e d e fi nite by Assumption 2 , w e see λ v > 0 holds forv ∈ {q 1, + 1} . 3 2626 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS By the abo v e defi nitions, we ca n decompose the re gulariz e rs’ H es sia ns as f ollo w s, Ñ2w,w Ωq 1( w) = Ñ2x˙ ,x˙ Ω+ 1( x, x˙ ) = λ q 1I m + ( Ñ2w,w Ωq 1( w) q λ q 1I m) , λ + 1I m∙n + ( Ñ2x˙ ,x˙ Ω+ 1( x, x˙ ) q λ + 1I m∙n) . A s the r e gula r iz er s ar e stric tly λcon x, v e so th at f or ea ch of the abo v e equatio ns the v arveepositi first summand is positi v e definite and th e sec o nd summa nd is positi v e semi-definite. Pr op osition 4The p s eudo - J ac o bia n has th e r e p r esentation ( 1) ( 2) ( 3) Jr ( w, x˙ ) = Jr ( w, x˙ ) + Jr ( w, x˙ ) + Jr ( w, x˙ ) ( 13) wh e r e ( 1) Jr ( w, x˙ ) = Λr ( 2) Jr ( w, x˙ ) = Λr ( 3) Jr ( w, x˙ ) = Λr T _ X 0 ∙ ∙ ∙ 0 _ _ Γ q 1 Γ q 1 __ X 0 ∙ ∙ ∙ 0 _ , Γ+ 1 Γ + 1 0 W1 ∙ ∙ ∙ Wn 0 W1 ∙ ∙ ∙ W n ρq 1λ q 1I m cq 1,1ℓ ′q 1,1I m ∙ ∙ ∙ cq 1,nℓ ′q 1,nI m 0 c+ 1,1ℓ ′+ 1,1I m ρ+ 1λ + 1I m ∙ ∙ ∙ .. .. .. .. . . . . ∙ ∙ ∙ ρ+ 1λ + 1I m 0 c+ 1,nℓ ′+ 1,nI m , ˆ q 1( w) q ρq 1λ q 1I m _ ρq 1Ñ2w,w Ω _ 0 . 2 ˆ ρ+ 1Ñx˙ ,x˙ Ω+ 1( x, x˙ ) q ρ+ 1λ + 1I m∙n 0 ( 1) ( w, xices ˙ ), The abo v e proposition r esta tes the pseudo-Jacobian as a sum of the threeJrmatr ( 2) ( 3) ( 1) ( 2) ′′ Jr ( w, x˙ ) , a n dJr ( w, x˙ ) . Matrix Jr ( w, x˙ ) conta ins allℓ v,i te r msJr, ( w, x˙ ) is a composition of ( 3) ˙ ) contains th e H essians of the re gulariz e rs w her e th e d iagonal scaled id entity matr ices, Jand r ( w, x entrie s are reduced ρ by W .e f urther analy ze these ma tric e s in the q 1λ q 1 and ρ+ 1λ + 1, r especti v ely follo wing se ctio n. 3.2.3 D E FI N I T E N E S S O F T H E S U M M A N D S O F T H E P S E U D O - J A C O B I A N , x˙ ) is positi Jr ( w Recall, that w e w a n t to in v estig ate w hethe r th e pseudo-J acobia n v e definite f or ( 1) ( 2) n ( w, x˙ ) ∈ W × φ( X) . A suf ficie nt condition is tha each pair of ac tio ns Jr t ( w, x˙ ) , Jr ( w, x˙ ) , and ( 3) Jr ( w, x˙ ) a re positi v e semi-definite and at least on e of thes e matr ices is positi v eFrom definite. ( 3) ( 2) the definition ofλ v, it b e come s appar entJthat In addition, Jr ( w, x˙ ) r is positi v e semi-defin ite. ob viously becomes positi v e definite f o r suf ficie ntlyρvlaras, gein this case, th e main d iagonal ( 1) domin a te s th e non-diagonal entries. Finally Jr ( w, x˙ ) become s positi v e semi-definite under some mild conditions on the loss functions. I n the f ollo w ing w e de ri v e th ese conditions, sta te lo w er bound s on the r e gula r iz ation para me ρv, and pr o vide for ma l proof s o f the abo v e c la ims. T he refore , w e mak e the follo wing assumptions ˆ v for v ∈ {q 1, + 1} . I n stances of th ese functions ℓ v and the re gu larizers Ω on the loss f unctio ns satisfying Assumptio ns 2 and 3 w ill be g i v e n in Section 5. A dis cussio n on the pra ctical imp lications of these assumptions is gi v en in th e subsequent section. T A s su mptio nFor 3 a llw ∈ W and x˙ ∈ φ( X) n with x˙ = _x˙ T1 ,..., x˙ Tn _ the fo llo wing conditions ar e satisfied: 2627 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R 1. the se cond deriva tives of the loss functio ns ar e equal∈foYr and all y i = 1,..., n, ℓ ′′q 1( fw ( x˙ i ) , y) = ℓ ′′+ 1( fw ( x˙ i ) , y) , 2. the play er s’ r e gularizatio n par am eter s satis fy ρq 1ρ+ 1 > τ2 1 cT c+ 1, λ q 1λ + 1 q 1 λ q 1, λ + 1 a r e the smalle st eig en values of the Hessians of the r e gulariz er s spe cified w her e T ) ( 12), cv =[ cv,1, cv,2,..., cv,n] , and in ( 11 and τ= 1_′ ℓ q 1( fw ( x) , y) + ℓ ′+ 1( fw ( x) , y) _ _ _, ( x,y) ∈ φ( X) × Y 2 sup ( 14) 3. for a ll i= 1,..., n either b oth playe r s have equa l in stance-spec ific costq facto 1,i = cr+ s, 1,i ,c x˙ j of or th e partial derivativÑex˙ i Ω+ 1( x, x˙ ) of th eda ta generato rr’esgular iz e r is independent 6 for all j i. = × Y is assume d to be Notice, thatτ in Equatio n 14 can be c ho se n to be finite a s φ th( Xe) set ′ ( ( , y)s andℓ ′+ 1( fw ( x) , y) compac t, and consequently , the v alues of b oth continuous ℓma ppin f x q 1 w )g are finite for all( x, y) ∈ φ( X) × Y . Lemma 5 Let ( w, x˙ ) ∈ W × φ( X) n be arbitr arily given.U nder Assum ptions 2 and 3, th e m atrix ( 1) Λr defined Jr ( w, x˙ ) is sym metr ic po s itive se m i-definite ( b ut not po sitiv e definite) for as in Equation 8. Λrof Pr o ofTh . e special str uctu re , X˙ , a n W d i gi v es ( 1) Jr ( w, x˙ ) = T _ X˙ 0 ∙ ∙ ∙ 0 _ _ r 0Γ q 1 r 0Γ q 1 __ X˙ 0 ∙∙∙ 0 _ , ¡Γ + 1 ¡Γ + 1 0 W1 ∙ ∙ ∙ Wn 0 W1 ∙ ∙ ∙ Wn ℓ ′′q 1,i = ℓ ′′+ 1,i and the definitio rn0 = 1, r i = w ith¡ := diag( r 1,..., r n) . From the assumption Γ q 1 = ¡Γ + 1, such th at for all i = 1,..., n it f o llo ws that ( 1) Jr ( w, x˙ ) cq 1,i c+ 1,i >0 T _ X˙ 0 ∙ ∙ ∙ 0 _ _ Γ q 1 Γ q 1 __ X˙ 0 ∙∙∙ 0 _ = , Γq 1 Γ q 1 0 W1 ∙ ∙ ∙ W n 0 W1 ∙ ∙ ∙ W n ( 1) T Jr ( w, x˙ ) z ≥ 0 holds f or w hich is ob viously a sy mmetric matrFixur . thermor e, w e sho wzthat + ∙ m mn all v ector sz ∈ R . T o this end,le t z be arbitrar ily gi v en, and partition th is v ecto r zin= T _zT0 , zT1 ,..., zTn _ w ithzi ∈ Rm for all i = 0, 1,..., n. Then a simple calc ulatio n sho w s that ( 1) zT Jr ( w, x˙ ) z = n _ zT x + zT w_ 0 i i ∑ = 2 ′′ cq 1,i ℓ q 1,i ≥ 0 i 1 sinceℓ ′′q 1,i ≥ 0 for all i = 1,..., n in vie w of the assume d con v e xity of maℓ pping q 1( z, y) . H enc e, ( 1) Jr ( w, x˙ ) is positi v esemi-defin ite. This matr ix c annotbe p ositi v definite e sin cew e ha v e T ( 1) z Jr ( w, x˙ ) z = 0 f or th e par ticula r v ector z defined byz0 := q w andzi := xi for all i = 1,..., n. 2628 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS Lemma 6 Let ( w, x˙ ) ∈ W × φ( X) n be arbitr arily given.U nder Assum ptions 2 and 3, th e m atrix ( 2) Jr ( w, x˙ ) is positive definite foΛrr defined as in E quation 8. ( 2) ( w, xtrix ˙ ) to be Pr o ofA. suf fi c ie nt and necessa ry condition f o r the (possibly asymmetr Jic) r ma positi v e definite is tha t th e Hermitia n ma trix ( 2) ( 2) H ( w, x˙ ) := Jr ( w, x˙ ) + Jr ( w, x˙ ) T 1 is positi v e defi nite, tha t is, all eig en v alues H ( w,of x˙ ) are positi v eLet . Λr2 denote the square root 1 2 of Λr w hich is defin e d in su c h a w ay tha t the dia gonal eleΛments of squar e roots of the r are the 1 q 12 corr esponding diagonal e le menΛts F ur thermo re, we denoteΛby the in v er seΛofr2 . Th e n, r . of r by Sylv e ste r’ s la w of inertia, th e matr ix 1 1 q q H¯( w, x˙ ) := Λr 2 H ( w, x˙ ) Λr 2 ( w, x˙ )as has the s ame n umb er of positi v e, zero, and ne g ati v e eige n vHalues itself. matr ix ( 2) Hence,Jr ( w, x˙ ) is positi v e definite if, and only if , a ll eigen v alu es of ¯( w, x˙ ) = H = 1 q 1 Λr 2 Λr q Λr = 1 q q ( 2) ( 2) Λr 2 _ Jr ( w, x˙ ) + Jr ( w, x˙ ) T _ Λr 2 1 2 ρq 1λ q 1I m cq 1,1ℓ ′q 1,1I m ∙ ∙ ∙ cq 1,nℓ ′q 1,nI m 0 c+ 1,1ℓ ′+ 1,1I m ρ+ 1λ + 1I m ∙ ∙ ∙ q1 Λr 2 + .. .. .. .. . . . . ∙ ∙ ∙ ρ+ 1λ + 1I m c+ 1,nℓ ′+ 1,nI m 0 ρq 1λ q 1I m c+ 1,1ℓ ′+ 1,1I m ∙ ∙ ∙ c+ 1,nℓ ′+ 1,nI m 0 cq 1,1ℓ ′q 1,1I m ρ+ 1λ + 1I m ∙ ∙ ∙ q1 Λr Λ r 2 .. .. .. .. . . . . ′ ℓ ∙ ∙ ∙ ρ λ 0 cq 1,n q 1,nI m + 1 + 1I m ∙∙∙ 2ρq 1λ q 1I m c˜1I m c˜nI m ρ λ ∙ ∙ ∙ c˜1I m 2 + 1 + 1I m 0 .. .. .. .. . . . . ∙ ∙ ∙ ρ λ 0 2 + 1 + 1I m c˜nI m √ λ of this ma trix satisfi es are positi v e, w her e ˜ cq 1,i c+ 1,i ( ℓ ′q 1,i + ℓ ′+ 1,i ) . Each eigen v alue ci := ¯( w, x˙ ) q λ I m+ m∙n_v = 0 tH T = _vT0 , vT1 ,..., vTn _ with vi ∈ Rm for i = 0, 1,..., n. This eig enfor the cor responding eig en vvector v alu e e q ua tio n ca n be re written block-wis e as n ( 2ρq 1λ q 1 q λ ) v0 + c˜i vi ∑ = = 0, ( 15) 0 ∀ i = 1,..., n. ( 16) i 1 ( 2ρ+ 1λ + 1 q λ ) vi + c˜i v0 = 2629 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R v0t= 0. Then ( 15) T o compute all po s sible eig en v alues, we consider twFir o cases. st, assume tha and ( 16) r educe to n ∑ c˜i vi = 0 and ( 2ρ+ 1λ + 1 q λ ) vi = 0 ∀i = 1,..., n. i= 1 Sincev0 = 0 and eigen v ector v 6 0, at least on vei is non-zer o.This implies tha λt = 2ρ+ 1λ + 1 is n an e igen v alue. Using the f a= ct th at the null space of the linearv 7→ ma∑pping ˜i vi has dimension i= 1 c ( n q 1) ∙ m ( w e ha n ∙ ,..., v em de g r ees o f f reedom countin g all components v1 of vn andm equations λ = 2ρ+ 1λ + 1 is a n eige n v a lue of multiplicity ( n q 1) ∙ m. in ∑ni = 1 c˜i vi = 0), it f ollo w s that λ 6t 2ρ+ 1λ + 1 No w w e conside r the sec o nd ca se vw0 6 here 0. W e may f urther assume tha = = multiplicity). W e then (sin ce othe rw ise w e get the same eigen v alue as befor e, just with a dif fer ent get from (16) tha t c˜i vi = q v0 ∀i = 1,..., n, ( 17) ρ λ 2 +1 +1q λ and when substituting th is e xpressio n into (15) , we ob tain n ( 2ρq 1λ q 1 q λ ) q c˜2 ∑ 2ρ+ 1λ +i 1 q λ ! v0 = 0. i= 1 T aking into account that v0 6 0, th is imp lies = 0 = 2ρq 1λ q 1 q λ q n 1 c˜2i 2ρ+ 1λ + 1 q λ i∑ =1 and, ther efore, 0 = λ 2 q 2( ρq 1λ q 1 + ρ+ 1λ + 1) λ + 4ρq 1ρ+ 1λ q 1λ + 1 q n ∑ c˜2i . i= 1 The roots of this quadra tic equation are λ = ρq 1λ q 1 + ρ+ 1λ + 1 ± s n ( ρq 1λ q 1 q ρ+ 1λ + 1) 2 + c˜2i , ∑ = ( 18) i 1 ¯( w, x˙of ) , each of multiplicitym since th ere are precisely and th ese ar e the remaining e igen v alues H v0 6rs0 whereas the other v ector vi (is= 1,..., n) a re uniq u e ly defined m lin e arly independent v ecto = by ( 1 7) in this case . I n particular , this implie s that the dimensions of all th ree eigenspace s togeth er = is ( n q 1) m + m + m =( n + 1) m, hence oth er eigen v alu es c anno S t einxis cet.th e e igen vλalue 2ρ+ 1λ + 1 is positi v e by R emark 3, it re ma ins to sho w that th e roots in (18) are positi v e a s well. B y Assumptio n 3, w e ha v e n ∑ c˜2i = i= 1 n ′ ′ cq 1,i c+ 1,i ( ℓ q 1,i + ℓ + 1,i ) 2 ≤ ∑ = 4τ 2cTq 1c+ 1 < 4ρq 1ρ+ 1λ q 1λ + 1, i 1 T wh e re cv =[ cv,1, cv,2, ∙ ∙ ∙, cv,n] . This inequality and Equ a tio n 18 gi v e λ = ρq 1λ q 1 + ρ+ 1λ + 1 ± s n ( ρq 1λ q 1 q ρ+ 1λ + 1) 2 + c˜2i ∑ = i 1 > ρq 1λ q 1 + ρ+ 1λ + 1 q q ( ρq 1λ q 1 q ρ+ 1λ + 1) 2 + 4ρq 1ρ+ 1λ q 1λ + 1 = 0. 2630 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS ¯( w, x˙ ) a repositi v ema A s all eig en v alues of H , trixH ( w, x˙ ) and, consequentlyals , othe ma trix ( 2) ( , ) ˙ Jr w x a re positi v e defi nite. Lemma 7 Let ( w, x˙ ) ∈ W × φ( X) n be arbitr arily given.U nder Assum ptions 2 and 3, th e m atrix ( 3) Λr d e fin e d as in Equ ation 8. Jr ( w, x˙ ) is positive semi- definite for Pr o ofBy . A s su mp tio n 3, either both play e rs ha v e equal in sta nce- specific costs , or th e partial gra d i ˆ + 1( x, x˙ ) of the se nd e r’ x˙ j of ent Ñx˙ i Ω resgulariz e r is independent for all j 6 i and i = 1,..., n. L e t = = us c on sid er the fi r st case cwhere c+ 1,i , and consequently r i 1, f or=a ill= 1,..., n, such that q 1,i ( 3) Jr ( w, x˙ ) = ˆ q 1( w) q ρq 1λ q 1I m _ ρq 1Ñ2w,w Ω _ 0 . 2 ˆ ρ+ 1Ñx˙ ,x˙ Ω+ 1( x, x˙ ) q ρ+ 1λ + 1I m∙n 0 The e igen v alueofs th is block dia gonal matr ix ar e the eigen v alu es of the ma trix 2 2 ˆ ˆ ρq 1( Ñw,w Ωq 1( w) q λ q 1I m) together with those ofρ+ 1( Ñx˙ ,x˙ Ω+ 1( x, x˙ ) q λ + 1I m∙n) . From the defi, + 1} . nition of λ v in ( 11) and (12) follo ws that these matr ices are positi v e semi-definite v ∈ { q 1for ( 3) H e nce, Jr ( w, x˙ ) is positi v e semi-definite a s well. ˆ + 1that Ñx˙ i Ω ( x, x˙ ) is indepe n de n x˙t jof No w , let us c o nsid er th e second ca se where we assume 2 ˆ for all j 6 i . Hence,Ñx˙ i ,x˙ j Ω+ 1( x, x˙ ) = 0 for all j 6 i such th at = = ˜ q1 ρq 1Ω ∙∙∙ 0 0 cq 1,1 ˜ ρ Ω ∙ ∙ ∙ 0 0 + 1 c+ 1,1 + 1,1 ( 3) , Jr ( w, x˙ ) = .. .. .. .. . . . . cq , ˜ + 1,n ∙ ∙ ∙ ρ+ 1 c 1 n Ω 0 0 + 1,n ˜ q 1 := Ñ2w,w Ω ˆ q 1( w) q λ q 1I m and Ω ˜ + 1,i = Ñ2x˙ i ,x˙ i Ω ˆ + 1( x, x˙ ) q λ + 1I m. Th e eigen v alu es of this Ω w he re ˜ q cks ρqb1Ω block d iagonal matr ix ar e ag a in the union of the e igen v alues of the sin gle lo 1 and cq 1,i ˜ ˜ ρ+ 1 c+ 1,i Ω+ 1,i for i = 1,..., n. As in th e first p a rt of the proof, Ωq 1 is positi v e semi-definite. The 2 ˆ 2 ˆ Ñx˙ i ,of ˙ ) are th e unio n of a ll eig en v alues ˙ ) . H ence, each of eige n v a luesÑof x˙ ,x˙ Ω+ 1( x, x x˙ i Ω+ 1( x, x ˜ λ + 1 and Ω+ 1,i is positi v e se mi- definite. The these eig en v alues is lar ger or equal to th us, eac h block cq 1,i ρq 1 > 0 a n ρ f a ctors d+ 1 c+ 1,i > 0 are multipliers th at do not af f ect the definiteness of the blocks, and ( 3) consequentlyJr, ( w, x˙ ) is positi v e se mi- definite a s well. The pr e vious results guarantee the e x iste nce and uniq u e ness of a N ash equilibrium under the state d a ssumptio ns. Th eor e m Let 8 Assum ptions 2 and 3 hold. Then th e N ash pr edic tion gam(3) e in has a unique equilibrium. Pr o ofThe . e xiste nce of a n equilib r ium of th e N ash pre d iction g a me in (3) follo w s from Lemma 1 . Λr such Jr ( w, x˙ ) Proposition 4 and Lemma 5 to 7 imply th at th ere is a positi v e diago na l ma trix th at w, x˙ ) ∈ W × φ( X) n. H e n c e, the uniq u e ness follo ws from Lemma 2. is positi v e definite for (all 2631 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R 3.2.4 P R A C T I C A L I M P L I C A T I O N S O F A S S U M P T I O N S 2 A N D 3 Theorem 8 guarantees the uniqueness of the equilibriu m on ly if the cost f u nc tio ns of le a rner and data gener ator r ela te in a certain w a y th at is defined by Assumption I n a 3. d dition, e ach o f the cost function s has to sa tis f y A ssump tio n 2. This se ctio n d iscusses the practical implic a tio n of th ese assumptions. The conditions of A s su mp tio n 2 impo s e rathe r te chnical limitation s on th e cost functions. The requireme nt of con v e xity is quite ordinar y in the ma chin e learnin g In c oaddition, nte xt. the lo ss function has to b e tw ice continuou sly dif fere n tiable, w hich restric ts the f amily of eligible loss f unctions. Ho w e v er , this condition can still be me t easily; for instance, by smoothe d v ersio ns of the hinge loss. The second requireme nt of uniformly str ongly con v e x and twic e continu ously dif f er entiable r e gula r iz er s is , ag ain , only a week r estr ictio nThese in pracrequir tic e .e me nts are met by stan da rd re gula rizers; the y occur , for instance, in th e optimizatio n criteria of S VMs and logis tic re gression. T he requir e me nt of non-empty , compa ct, and c on v e x action spaces ma y be a restric tion wh e n de alin g with binary or multinomial attrib ute s. H o w e v er , r ela xing th e action spaces of the da ta genera tor w ould typically r esult in a str a te gy th at is mo re defe nsi v e than w ou ld be optimal b ut still less def ensi v e tha n a w ors t- cas e strate gy . The first c on dition of Assumptions 3 requires the cost functio ns of le ar ner a nd data generato r to ha v e the sa me curv ature Thiss.is a cruc ia l re str ictio n; if th e cost f unctio ns dif fe r arbitra rily the Nash equilibrium may not be unique . T he r equir ement of ide n tical curv atu res is me t, for insta nce , ℓ ( fw ( x˙ i ) , y) whic h o nly depends on th e term if one pla yer chooses a loss function y fw ( x˙ i ) , such as for SV M’ s h in ge lo ss or th e lo gis ticI nlothss. is c ase, the condition is me t wh e n the other player ℓ ( fof ˙ i ) , y) as it a p pr oaches chooses theℓ ( q fw ( x˙ i ) , y) . T his loss is in some sense th e opposite w(x zer o w he n the other goes to infinity and vice v ersa. In this case , the cost functions may still be non-antago nistic b e cause the player’ s c ost functions ma y contain instan c e-specific cv,ci o st f a ctors that ca n be mode le d in dependently for the p layers. The second par t of Assumptions 3 couple s the de gre e of r e gula r iz ation of the players. I f the da ta genera tor produce s insta nce s at a p plication time that dif fer g r eatly from the instances at tr a ining time, then the learner is required to re gula r iz e strong ly for a u niq ue equilibrium to If the e xis t. dis trib u tions at tr ain ing and applicatio n time ar e more similar , th e equilibrium is unique for smaller v a lues of th e le a rner’ s re g ularizatio n paramete rs. This r equir e me nt is in line with th e intuition th at wh e n the tr a ining insta nc es ar e a poor approximatio n o f th e distr ib utio n at applicatio n time, then impo sin g only w ea k re gu larizatio n on the loss function will re sult in a poor model. The final r equir ement of A ssumptions 3 is, ag ain, rather a technic al limita tion. It sta tes that the interdependencies between the pla yer s’ in sta nce- specific costs must be either c aptured by the re gulariz er s, leading to a f ull Hessian, or by cost f ac tors. These cost f ac tors o f le ar ner and data genera tor ma y dif fe r ar bitrarily if the gradient of th e data gener ator’ s costs of transformingxan to ce i in instan x˙ i a re independent of all other in stax˙nces j w ith j 6 i . T his is me t, f or instan c e, by cost models th at = xbetween ˙i. only depend on some me a sure of th e dis tance i andx 4. Fi nding the U ni que N ash Eq uil ibri um According to Theorem 8, a unique equilibriu m of th e N ash pre dic tion g ame in ( 3 ) e xists for suitable loss functions and re gularizers. T o find th is equilibrium, w e deri v e and study tw o dis tinct methods: The fi r st is based on the Nikaid o - Isoda f u nc tio n tha t is c on s tr ucted such that a min imax solution of this functio n is an equilibrium of th e N ash pre d iction g a me and v ice T his v ersa proble . m is then 2632 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS solv e d by in e xact lin e search. I n the second approa ch, we r eformulate the N ash predictio n g ame in to a v ariatio na l in equality problem whic h is solv ed by a modified e xtragradient me thod. The data gener ator’ s action o f tr ansf ormin g th e input dis trib ution ma nifests in a concaten a tio n of x˙ ∈ φ( X) n ma pped into the f eatur e space x˙ i := φ( x˙i ) for i = 1,..., n, transformed tr ain ing instances and the le ar ner’ s action is to choose weightwv ∈ector W o f classifierh( x) = sign fw ( x) with lin e ar decision f u nc tiofwn( x) = wT φ( x) . 4.1 An Inexact Linesear c h A ppr oach T o solv e for a N ash equilibrium, we ag a in consid er the g ame from ( 4 ) with onenle daarne ta r a nd genera tors. A solution of this g ame can be id entifi ed with the help of th e weighte N ika id d o-Isoda , x˙ ns ) ∈ W × φ( X) n and ( w′ , x˙ ′ ) ∈ function (Equatio n 19 F ) .or an y tw o c o mb inations of ac( wtio T T W × φ( X) n w ithx˙ = _x˙ T1 ,..., x˙ Tn _ and x˙ ′ = _x˙ ′1T ,..., x˙ ′nT _ , th is f unctio n is the w e ighte d sum of w togy w′ andx˙ i to x˙ ′i , relati v e cost sa vings th at n +the 1 pla yers can enjo y by changing fr om strate ( w, x˙ding ) ; thatot is, respe cti v ely , w hile the other players c on tinue to play accor J r ( w, x˙ , w′ , x˙ ′ ) := r 0 t θˆ q 1( w, x˙ ) q θˆ q 1( w′ , x˙ ) _ + n r i _ θˆ + 1( w, x˙ ) q θˆ + 1( w, x˙ i ) _ , ∑ = () ( 19) i 1 T w he re x˙ (i ) := _x˙ T1 ,..., x˙ ′i T ,..., x˙ Tn _ . Let us denote the w e ighte d sum of grea te st possible cost sa v, x˙ ) ∈ W × φ( X) n by ings w ith r espect to a n y gi v en combination of( w actions J¯r ( w, x˙ ) := max ( w′ ,x˙ ′ ) ∈ W × φ( X ) n J r ( w, x˙ , w′ , x˙ ′ ) , ( 20) ¯( w, x˙ ) , x¯( w, x˙ ) denote s the corre sp ondin g pair of ma ximizer w he rew N ote s. that the ma ximum ˙ , w′ , x˙ ′ ) is in ( 20) is attained for an( wy, x˙ ) , sinceW × φ( X) n is assume d to be compac tJand r ( w, x ′ ′ ( w , x˙ ) . continuo us in ( w∗ , x˙ ∗ ) is an equilibrium of th e N a sh predic tion g ame if, By these defin itions, a combination ∗ ∗ J¯r with J¯r ( w∗ , x˙ ∗ ) = 0 f or an y fix ed w eig hts and only if J¯ , r ( w , x˙ ) is a g lo bal min imum of mapping r i > 0 and i = 0,..., n, see Proposition 2.1 (b) of v on H eusin g e r and Kanzo w (2009). E qui v alently , ¯( w∗ , x˙ ∗ ) = w∗ andx¯( w∗ , x˙ ∗ ) = x˙ ∗ . a N ash equilibrium simu ltaneously satisfies bo th equations w The sig n ific ance of this obser v ation is that the equilibriu m problem in (3) c an be ref ormulate d ¯r ( w, x˙ ) . T o solv e this minimization p r obinto a min imization proble m of the continuous ma Jpping lem, we ma k e use o f C orollary 3.4 (v on H e u sin ger and K a nzo w , 2009). W e := 1the weights r 0set cq 1,i = = ,..., and r i : c+ 1,i for all i 1 n as in (8), wh ich ensures the ma in condition of C or olla r y 3 .4 ; th at Jr ( w, x˙ )nin ( 13) ( cf. proof of Theore m 8). Accordin g to is , the positi v e definiten e ss of the Jacobia this corollar y , v ector s ¯( w, x˙ ) q w and d+ 1( w, x˙ ) := x¯( w, x˙ ) q x˙ dq 1( w, x˙ ) := w ( w, x˙ ) ∈ form a desce nt direction d( w, x˙ ) :=[ dq 1( w, x˙ ) T , d+ 1( w, x˙ ) T ]T of J¯r ( w, x˙ ) at an y position ∗ ∗ n her, x˙e ) = 0), and consequently , the re e xis ts W × φ( X) ( e xcept f o r the N ash equilibriu m dw( w t ∈ [0, 1] s u c h that J¯r ( w + t dq 1( w, x˙ ) , x˙ + t d+ 1( w, x˙ )) < J¯r ( w, x˙ ) . ¯( w, x˙ ) , w ¯( w, x˙ )) are f easible combin atio ns of actions, the con v e x ity of the ac tio n Since,( w, x˙ ) and( w ( + spaces ensure s thwat t dq 1( w, x˙ ) , x˙ + t d+ 1( w, x˙ )) is a feasib le combination for an t ∈y [0, 1] as well. The f o llo w ing algorithm e xploits th ese p r operties. 2633 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R A lg orit hm 1ILS: Ine xac t Linese arch S olv er for N ash Predictio n G ames θˆ v as defined in (1) and (2), and a ctio n spaces R e q uir eC: ost functions W andφ( X) n. 1: Sele c t in itial w( 0) ∈ W , setx˙ ( 0) := x, setk := 0, a nd seleσc ∈t ( 0, 1) andβ ∈ ( 0, 1) . cq 1,i 2: Set r 0 := 1 andr i := c+ 1,i for all i = 1,..., n. 3: r epeat ( k) ( k) ( k) 4: ¯( k) q w( k) w her ew ¯( k) := ar gmax ˙ , w′ , x˙ ( k) _. Set dq 1 := w w′ ∈ W J r t w , x 5: 6: ( k) ( k) ( k) ˙ , w( k) , x˙ ′ _. Set d+ 1 := x¯( k) q x˙ ( k) w he rex¯( k) := ar g max x˙ ′ ∈ φ( X) n J r t w , x Find maxima l step size t ( k) ∈ _ βl | l ∈ N with _ (k) ( k) ( k) J¯r _ w( k) , x˙ ( k) _ q J¯r _ w(k) + t ( k) dq 1, x˙ ( k) + t ( k) d+ 1_ ≥ σ t ( k) dq 1 2 2 ( k) 2_ + d+ 1 2 . ( k) 7: Set w( k+ 1) := w( k) + t (k) dq 1. 8: Set x˙ ( k+ 1) := x˙ ( k) + t (k) d+ 1. ( k) Set k := k + 1. ( ) ( q ) 10: un t il w k q w k 1 9: 2 + 2 x˙ ( k) q x˙ ( kq 1) 2 ≤ 2 ε. The con v e r gence properties of A lg o r ith m 1 are dis c u ss ed by v on H eusinge r a nd Kanzo w ( 20 so we sk ip the details here. 4.2 A Modified Ex t ragradie nt A ppr oach In A lgor ith m 1, line 4 a n d 5, as well as the lin ese arch in line 6, re qu ire to s o lv e a conca v e max imiz tion proble m w ithin each iteration. As th is may become computa tionally dema nding, we deri v e a second approa ch based on e xtragradient desce T her nt.efore, inste a d o f ref ormulatin g th e equilibriu m problem into a minimax proble m, w e directly addre ss the fi r st- or der op tima lity cond itions of , x˙ ∗ ) s each pla yer s’ minimization proble m in (4): U nder A ssumption 2, a combination (of w∗action T w ithx˙ ∗ = _x˙ ∗1T ,..., x˙ ∗nT _ satisfies eac h player’ s fi r st- or d e r optimality conditions if , and o nly if , f or all ( w, x˙ ) ∈ W × φ( X) n th e f o llo w ing inequalities hold, Ñwθˆ q 1( w∗ , x˙ ∗ ) T ( w q w∗ ) ≥ Ñx˙ i θˆ + 1( w∗ , x˙ ∗ ) T ( x˙ i q x˙ ∗i ) ≥ 0, 0 ∀ i = 1,..., n. A s the joint actio n space of all pla yer W s× φ( X) n is precisely th e full Cartesia n pr o duct of the φ( X) , the (w e ighte d) sum of those le a rner’ s action set W and then data gener ators’ action sets indi vidual optimality conditions is a ls o a suf ficie nt and nece ssary optimality condition for the e q ui∗ ∗ ( wium , x˙ ) ∈ W × φ( X) n is a solution of th variational librium proble m. Hence, a N ash e q uilibr e inequality pr oble, m ∗ ∗ gr ( w , x˙ ) T __ w _ _ q x˙ w∗ __ ≥ 0 ∀( w, x˙ ) ∈ W × φ( X) n x˙ ∗ ( 21) and vice v ersa (cfProposition . 7.1 of H a rk er and P ang, 1990). T he pseudo- g r adie gr in nt (21) is cq 1,i T defined a s in (5) with fix ed v ecto r =[ r 0, r 1,..., r n] w herer 0 := 1 andr i := c+ 1,i for a lli = 1,..., n 2634 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS r ensure gr ( w, x˙ ) is continu(cf . Equ a tio n 8). U nder Assumptio n 3, th is choic e ofs th at the mapping ous and str ictly monoto n e (cf. proof of L emma 2 and T heore m 8). H enc e, th e v ariatio nal inequality problem in (21) c an b e solv edmbyodifi ed e xtr a gr adie nt descent ( see, f o r instance, Chapte r 7.2 .3 of Geiger a n d Kanzo w , 1999). Before pre sentin g Algorithm 2, whic h is a n e xtr agr adie nt-based algor ith m f or the N ash predictio n g ame, le t u s d eL2no -prte o jection the ofa into th e non - empty , compac t, and con v e A x set by ΠA ( a) := ar g′ min ka q a′ k22. a ∈A / A, this No tice, tha t ifA := { a ∈ Rm | k ak2≤ κ } is th e clo sed l 2-ball of ra diusκ > 0 and a ∈ κ projectio n simp ly reduce s to a r escaling of av to ecletor ngth . Π Based on this defi nition ofA , w e ca n no w state an itera ti v e me th od ( A lgor ith m 2), w hic h— apar t from back proje c tio n ste ps—does not require solving an o ptimization problem in e ach iter ation. The proposed algorithm con v er ges to a solu tio n of the v a ria tional ine q uality problem in 21; that is , th e unique equilibriu m of th e N as h p r edic tion g ame , if Assumptions 2 and 3 hold (cf . Theorem 7.40 of Geige r and K anz o w , 1999). A lg orit hm 2ED S: Extr agra die nt D esc ent S olv er for N ash P r edic tion Game s θˆ v as defined in (1) and (2), and a ctio n spaces R e q uir eC: ost functions W andφ( X) n. 1: Sele c t in itial w( 0) ∈ W , setx˙ ( 0) := x, setk := 0, a nd seleσc ∈t ( 0, 1) andβ ∈ ( 0, 1) . cq 1,i 2: Set r 0 := 1 andr i := c+ 1,i for all i = 1,..., n. 3: r epeat " d(k) # __ w( k) _ w( k) _ ( k) ( k) __ _ q1 t n 4: = Π q , q . ˙ Set : g w x r φ ( k) W × ( X) x˙ ( k) x˙ ( k) d+ 1 5: Find maxima l step size t ( k) ∈ _ βl | l ∈ N with ( k) ( k) T q gr _ w( k) + t ( k) dq 1, x˙ ( k) + t ( k) d+ 1_ 6: 7: Set ste p size of e xtr a g r adie nt γ( k) := q Set t ( k) _ ¯ k , x¯ k _ 2 gr w ( ) ¯( k) , x¯( k) _ gr t w 2 ( ) T 2 9: 2 + 2 x˙ ( k) q x˙ ( kq 1) 2 ≤ 2 ε. 2635 ( k) 2_ + d+ 1 " d( k) # q1 . ( k) d+ 1 _ w(k+ 1) _ __ w( k) _ ( k) _ q γ gr t w ¯( k) , x¯( k) _ . := Π W × φ( X) n ( k+ 1) x˙ x˙ ( k) Set k := k + 1. ( ) ( q ) 10: un t il w k q w k 1 2 ( k) w(k) _ ( k) " dq 1 # +t . ( k) x˙ ( k) d+ 1 _ w ¯(k) _ _ Set := x¯( k) 8: " d( k) # _ ( k) q1 ≥ σ dq 1 ( k) d+ 1 2 . B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R 5. Insta nce s of the N ash Pr ediction Ga me In th is sec tio n, w e pr esent tw o in sta nce s of the Nash prediction g ame and in v estig ate under whic h conditions those g a me s possess unique N a sh equilibria W e sta .rt by specifying both players’ loss ℓ q 1( z, y) is the function and re gula rizerAn . ob vious choice for the loss function of the learner zer o-one loss defin e d by _ 1 , if yz < 0 ℓ 0/ 1( z, y) := . 0 , if yz ≥ 0 ( z,is q 1) w hich pena liz e s positi v e decision v alA possib le choice for the da ta generato r’ s ℓlo 0/ 1ss uesz, in dependently of th e cla ss laTbel. he ra tio nale behind this choic e is th at the data generato r e xper ie nces costs when the learner blocks a n e v ent, that is , assigns an instan c e to the positi v e cla ss F or instan c e, a le g itimate ema il se nd e r e xperien c es costs when a le gitimate ema il is err oneously block e d just lik e an ab usi v e se nd e r , als o a ma lg amate to the , e xperiences costs data dg in ener ator w he n spam me ssages are block e d . Ho we v er , the zero-one loss v io la tes Assumptio n 2 a s it is neith con v e x nor twic e continuo usly dif f erentiable In the f. o llo w ing sections, we th erefor e approxima te the zer o-one lo ss by the and a ne wly de ri vtred , wsshich both satisfy lo g istic lo ss igonom e tr ic lo A s su mp tio n 2. ˆ + 1( D, D˙ ) is an estimate of the tr a n s forma tion costs that the data generato r inc urs Recall th atΩ w he n transfor min g th e dis trib ution that generates thexin sta tr nces ain ing time into the dis trib ui at x˙ i at a pp lication time. I n our analy sis , w e appr o ximate these c o sts tion that gene rate s the instances by the a v erage squar ed tance be tw e xi en andx˙i in the featur e spac e induced b y maφ,pping tha t l 2-dis is , n ˆ + 1( D, D˙ ) := 1 ∑ 1 kφ( x˙ i ) q φ( xi ) k22 . Ω ( 22) n i= 1 2 ˆ q e1(rw) penalizes the comple xity of the p r edic ti v he( xmode Ω ) = sign The learner ’ s re gulariz l fw ( x) . W e consider T ikhono v re gula rization, whic h, f or linear de cis ion fw , freduces un c tio ns to the squ a red 2-norm ofw, l ˆ q 1( w) := 1 kwk22. Ω ( 23) 2 Before prese n ting Nash the lo gis tic r e g r ession (NL R) and thNash e support vec tor mac (NS hine VM), w e tur n to a discussion on the applicability of genera l k e rnel functions. 5.1 A pp ly ing K er nels φ :pping So f ar , w e assumed the kno wle dge of featur e ma X → φ( X) such that w e ca n compute ( xi ) of the tr a ining instances an e xplicit featur e repre sentaφtion xi for all i = 1,..., n. Ho we v er , in some applic a tio ns, such a featur e ma pping is u nwie ld y or hard Inste to ideandtify , one . is of te n equipped with a k ernel function k : X × X → R wh ich measures the simila r ity b e tw e en tw o in sta nce s. G e nerally , k er nel function k is assumed to be a positi v e-se mid efinite k erne l such tha t it can be state d ∃φ w e, in terms of a scalar product in th e corr esponding repr odu c ing k er nel Hilbe r t spac ithth at is, k( x, x′ ) = φ( x) T φ( x′ ) . T o apply the represe n ter theore m (see , e.g., Sch ¨olk opf et al., 2001) we assume tha t the tr a nsformed instances lie in th e span of the ma pped train ing instances, that is , w e restrict th e data gener ator ’ s a ctio n space such that the tr ansf o r me xdi insta ar e mapp nc ese˙ d into the same subspace of the repr od uc ing k ernel H ilbert s p a ce as the unmodifie d trainin xgi .inBsta y this nces assumptio n, the 2636 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS φ( x˙ i ) ∈ φ( X) for i = 1,..., n can be e xpre ssed w e ight v ec w tor∈ W and the transfor me d instances ∃α that as lin ea r comb inations of the ma pped trainin g instances, such that i , Ξi j is, n w= ∑ αφi ( xi ) i= 1 n and φ( x˙ j ) = ∑ Ξi φj ( xi ) ∀ j = 1,..., n. i= 1 Further , let us assume tha t th e action spa and φ( X) n c an be ade q ua te ly tr anslate d into dual W ces action spaces A ⊂ Rn and Z ⊂ Rn× n, whic h is po ss ible , for in sta nce, W aif n dφ( X) n are c losed 2-balls. Then, a k ernelized v ariant of the N a sh pre dic tion g ame is obta ine d by in serting th e a bo v e l equations into the play e rs’ cost functions in (1) and ( 2) with re g ularizers in ( 22) and (23), θˆ q 1( α , Ξ) = θˆ + 1( α , Ξ) = n 1 ∑ cq 1,i ℓ q 1( αT K Ξei , yi ) + ρq 1 2 αT K α , i= 1 n 1 ∑ c+ 1,i ℓ + 1( αT K Ξei , yi ) + ρ+ 1 2ntr _( Ξ q I n) T K ( Ξ q I n) _ , ( 24) ( 25) i= 1 Ξ ∈ Z, is the dual tr a nsei ∈ { 0, 1} n is the i -th unit v ec torα ,∈ A is th e d ua l w eight v ector w he re formed d a ta ma trix, and K ∈ Rn× n is the k ernel matr ix withKi j := k( xi , x j ) . I n th e dual Nash predictio ng amewith c o sf tu nc tio ns ( 24 )and (25) ,the le ar n echooses r the dual w eig hvt ecto r α =[ α 1,..., α n]T a nd cla ssifi es a ne w insta nce x by h( x) = sign fα ( x) w ith fα ( x) = ∑ni = 1 α i k( xi , x) . Ξ, whic In contrast, th e da ta generato r chooses the dual transformed data ma trix h imp licitly re fl ec ts the change of the training dis trib ution. Ξ from Their transformatio n costs are in proportion to the de via tion of the identity matr ixI n, Ξ I w he re if equals n, the learner ’ s task reduces to sta ndar d k erne liz e d e mpir ical ris k minimization. α ag n d w bycin x˙ i by Ξei f or The proposed Algorithms 1 and 2 ca n be readily applied w hen r epla all i = 1,..., n. An alternati v e appr oach to a k erne liz a tio n of the Nash prediction g ame is to first constr uc t an e xplicit featu re re p r esentatio n with r espect to the gi v en k erknand e l function the tr ain ing instances and then to train the N ash model by a pp ly ing th is featu re ma pping. Here, w e ag ain assume th at the ( x˙ i ) as w ell as th e w eight v ector transformed instan cφes w lie in the span of th e e xplicitly ma pped φ( x) . Let us consider the k er nel P CA ma p (see, e.g ., Sch ¨olk opf and S mola, 2002) training instances defined by 1+ φPC A: x 7→Λ 2 V T [k( x1, x) ,..., k( xn, x)] T , ( 26) w he re V is the column matr ix of e igen v ector s of k ernelKma , Λ is trixthe dia gonal ma trix with the 1+ T 2 ΛV , andΛ de n otes the pse u do- in v erse of the squ a re corr esponding eigen v alu es such K =thVat 1 1 2 2 root of Λ with Λ = Λ Λ . R e mark 9N otice that for an y positi v e -semidefi nite k er n ek l: functio n R and fix ed tr ain X× X → φPC ing in sta nces x1,..., xn ∈ X, th e P CA ma p is a uniquely defined re al function w Aith : X → Rn T such thatk( xi , x j ) = φPC A( xi ) φP CA( x j ) for an yi , j ∈ { 1,..., n} : W e first sho w thφat P CA is a re al T n ma ppin g from th e input space R . A sx 7→[k( x1, x) ,..., k( xn, x)] is a re al X to the E uclidean space v ecto r - v alue d functionVand is a re aln × n ma trix, it r ema ins to sho w that the pseudo-in v er se of 1 2 Λ is real as well. S in ce the k ernel function is positi v e-se mid efinite, all eigeλ inov faK lues a re √ 1 2 Λ is a diago na l ma trix with rea l dia gonal entrie λ i forsi = 1,..., n. The non-ne g a ti v e, and hence, 2637 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R 1+ Λ 2matr pseudo-in v er se of this matr ix is th e uniquely defined dia gonal w ith ix rea l non-ne g a ti v e 1 √ λ > dia gonal entr iesλi if i 0 and zero oth erw ise. This pro v es th e fi r st claim. T he P CA map als o T satisfiesk( xi , x j ) = φP CA( xi ) φP CA( x j ) for an y pair of train ing in sta xnces i andx j as φP CA( xi ) = 1+ Λ 2 V T [k( x1, xi ) ,..., k( xn, xi ) ]T 1+ = Λ 2 V T K ei = Λ 2 V T V ΛV T ei = Λ 2 ΛV T ei 1+ 1+ for all i = 1,..., n and c o nse qu e ntly φP CA( xi ) T φP CA( x j ) = = = = 1+ 1+ eTi V ΛΛ 2 Λ 2 ΛV T ej + eTi V ΛΛ Λ V T ej eTi V ΛV T ej eTi K ej = K i j = k( xi , x j ) w hich pro v es th e se cond claim. 3 An equilibrium strate g y pair w∗ ∈ W and [φP CA( x˙ ∗1) T ,..., φP CA( x˙ n∗ ) T ]T ∈ φ( X) n can be id entifi e dby applying the P CAma ptogeth erw ith Algorithms 1 or 2. T oc la ssifya ne winstance x ∈ X w e may fi rs t map x into the PCA ma p-in duced featu re space and apply th e lin ea r cla ssifie r ∗ h( x) = sign fw ( x) w ith fw∗ ( x) = w∗ T φP CA( x) . Alternati v ely , w e can deri v e a dual re p r esentatio n of w∗ such th atw∗ = ∑ni = 1 α ∗i φ P CA( xi ) , and consequentlyfw∗ ( x) = fα ∗ ( x) = ∑ni = 1 α ∗i k( xi , x) , w he re α ∗ =[ α ∗1,..., α ∗n]T is a no t necessar ily uniquely defined dual weight v ecto w∗ .r of T he refore , we ∗ α o f the linear syste m ha v e to identify a solution 1+ ∗ ∗ w = Λ 2 VT K α . A direct calc ulatio n sho w s tha t ( 27) 1+ α ∗ := V Λ 2 w∗ ( 28) λ i of the is a so lu tio n of ( 27) pro vide d th at either all eleme ntsdia gonal matrΛixare positi v e or th at λ i = 0 implies that the same component of the v ecto w∗ is r also equal to zer o (in whic h case the solution is non-unique) . In f act, in serting (28) in (27) then g i v e s 1+ 1+ 1+ 1+ 1 1 2 Λ 2 V T K α ∗ = Λ 2 V T V ΛV T V Λ 2 w∗ = Λ 2 Λ 2Λ Λ 1+ 2 w∗ = w∗ w he reas in the othe r cases the linear system (27) is o b viously inconsiste The adv nt.antage of the ∈ X r equir es the c omputa tion of th e scalar latte r approach is that classif yin g a ne w in stax nces product ∑ni = 1 α ∗i k( xi , x) r ather than a matr ix multiplicatio n when ma pping x into the P CA mapinduced featur e space ( cf. Equ a tio n 2 6) . When imp leme nting a k ernelized solution, the data gener ator has to genera te insta nc es in the input space w ith dual r epresentatio K Ξn∗ e1,..., K Ξ∗ en and φPC A( x˙ 1∗ ) ,..., φPC A( x˙ ∗n) , respec ti v ely . T o this end, the data gener ator must solv e a pre- ima ge pr ob lem whic h typically ha s a non-unique solution. H o w e v er , as e v ery s o lu tio n o f this proble m incur s the same costs to both pla yer s the da genera tor is fre e to select an y of them. T o find suc h a solution, the data generato r ma y so lv e a non-con v e x optimizatio n pr o blem as pr o posed b y Mik a et a l. ( 199 9) , or may apply a non-iterati v e me th od (Kw ok and Tsang, 2003 ) based on multidimensional scaling. 2638 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS 5.2 Nash L ogistic R egr ession In this section w e s tudy the particular instance of th e N ash pre d iction g ame w her e each players’ lo ss 1 function rests on the ne g ati v e lo g arithm of th e logis ticσf(un a) :c= tio 1+ n eq a , that is, thelo gis tic loss ℓ l ( z, y) := q log σ( yz) = log t 1 + eq yz_. ( 29) W e c o nsid er the re gu larizers in ( 2 2) a n d (23) , r especti v ely , w hich gi v e rise to the follo wing de fi of theN ash lo gis tic r e gr e(N ssion LR). x :=[ sxT1 ,..., xTn ]T and x˙ :=[ x˙ T1 ,..., x˙ Tn ]T ag ain deI n the follo w in g definition, column v ector note the concatenatio n of the or iginal and the tr ansf orme d tr ain ing instances, r especti v ely , whic h are ma pped into the featur e spacxi e:=byφ( xi ) andx˙ i := φ( x˙ i ) . Defi nition 10 T heN ash lo g istic re g r ession ( N LR ) is a n in stance of the N ash pr e d iction game w ith ⊂ Rm an dφ( X) n ⊂ Rm∙n and cost fu nctio ns non-empty , comp ac t, and con v e x actioW n spaces θˆ lq 1( w, x˙ ) := θˆ l+ 1( w, x˙ ) := n 1 ∑ cq 1,i ℓ l ( wT x˙ i , yi ) + ρq 1 2 kwk22 i= 1 n ∑ c+ 1,i ℓ l ( wT x˙ i , q 1) + ρ+ 1 i= 1 1 n 1 kx˙ i q xi k22 2 n i∑ =1 l wh e rℓ e is specified in( 29). ℓ + 1function ( z, y) := ℓ l ( z, q 1) peAs in ou r intro ducto ry dis cussio n , the data g e nerator ’ s loss nalizes positi v e dec is ion v alues independently of th e class contrast, ins ta nces that pass y. Inlabel the classifier , that is , insta nc es w ith ne g ati v e dec is ion v a lues, incur little or almost no costs . By the abo v e d e fi nition, the N as h lo gis tic re gression ob viously sa tis fies Assumption 2, and accor ding to the follo wing cor olla r y , als o satisfi es Assumption 3 f o r suitable re gula rization para me te r s. C or o llary 11Let the N ash lo gistic r e g r ession be specifi ed as in D e fin ition 10 w ith positive r e gular iz ation par ameteρrq 1s an dρ+ 1 wh ic h satisfy ρq 1ρ+ 1 ≥ ncTq 1c+ 1, ( 30) then Assump tion 2 and 3 hold, and consequently , the Nash lo gistic r e gr ession po s sess a unique N ash equilibrium. ℓ q 1(with Pr o ofBy . Defi nition 10, both pla yers emplo y th e logistic loss z, y) := ℓ l ( z, y) andℓ + 1( z, y) := l ℓ ( z, q 1) and th e re gula rizers in (22) and ( 2 3) , respecti v ely . Let ℓ ′q 1( z, y) = q y 1+1eyz ℓ ′′q 1( z, y) = 1+1ez 1+1eq z ℓ ′+ 1( z, y) = ℓ ′′+ 1( z, y) = 1 1+ eq z 1 1 1+ ez 1+ eq z ( 31) ∈ R. respe denote the first and se cond deri v a ti v es o f the pla yer s’ loss f u nc tio nszwith Further ct ,to let ˆ + 1( x, x˙ ) = 1n ( x˙ q x) ÑwΩˆ q 1( w) = w Ñx˙ Ω ˆ q 1( w) = I m ˆ + 1( x, x˙ ) = 1n I m∙n Ñ2w,w Ω Ñ2x˙ ,x˙ Ω denote the gradie nts and H essians of th e pla ye rs’ re gularizers. A s su mp tio n 2 holds as: 2639 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R , y) and ℓ + 1( z, y) ar e positi v e and contin uous for 1. T he the second deri v a ti vℓ qes z ∈all R 1( zof ℓ v(, z, y) is con v e x and twice contin uously d if f erentiable w ith respe ct andy ∈ Y . C onsequently to z for v ∈ {q 1, + 1} and fi x eyd . 2. T he Hessians of the p layers’ re g ularizers a re fix ed, positi v e definite matr ices a n d consequently bo th re gulariz e rs ar e twic e continuously dif fer entiab le and unif or mly strongly w ∈ con v e x in n n x ∈ φ( X) ), respecti v ely . W andx˙ ∈ φ( X) (f or an y fix ed 3. B y Definition 10, the pla yers’ a ctio n sets ar e non-e mpty , compact, and con v e x subse ts of fi nite-dimensional Euclid e an space s. ∈ R andy ∈ Y : Assumptio n 3 holds as f orzall , y) andℓ + 1( z, y) in (31) are equal. 1. T he second der i v ati vℓ qes of 1( z 2. T he sum of the first deri v a ti v es of th e lo ss f un c tio ns is bou nded, ℓ ′q 1( z, y) + ℓ ′+ 1( z, y) ( 1 1 = qy + = 1 + ey z 1 + eq z 1q eq z 1+ eq z 2 1+ eq z , if y =+ 1 ∈ ( q 1, 2) , , if y = q 1 w hich together with E quation 14 gi v es τ= 1_′ ℓ q 1( fw ( x) , y) + ℓ ′+ 1( fw ( x) , y) _ _ _< 1. ( x,y) ∈ φ( X) × Y 2 sup τ is strictly le ss th an 1 since T he supr emum fw ( x) is finite for compact actio n seW ts and 1 φ( X) n. T he smallest eige n v a lues of the players’ re gulariz λ q er = λ = and + 1 n , such 1 s1are that inequalities 1 ρq 1ρ+ 1 ≥ ncTq 1c+ 1 > τ 2 cT c+ 1 λ q 1λ + 1 q 1 ho ld . ˆ + 1( x, x˙ ) = 1n ( x˙ i q xi ) o f the data genera tor’ s r e gula r iz er is in depenÑx˙ i Ω 3. T he partial gra d ient dent ofx˙ j for all j 6 i andi = 1,..., n. = As A ssumptions 2 and 3 are satis fied, the e x iste nce o f a unique Nash equilibrium follo w s immediately f rom T heore m 8 . Recall, tha t the w eighting f actocv,rs ∑ni = 1 cv,i = 1 for both players i are stric tly positi v e with 1 v ∈ {q 1, + 1} . In particula r , it theref ore follo w s th at in the unw e ighte d case cv,iw=her n feor all i = 1,..., n a n dv ∈ {q 1, + 1} , a suf ficient condition to ensure the e xis te nc e of a uniq u e Nash ρq 1 ≥ rρ1+to equilibriu m is to set the le a rner’ s re gula rization paramete . 1 5.3 Nash S uppo r t Vector Machine The N ash logistic re gr ession tends to non-sparse solutio T hisns. become s pa rticula rly apparent if ∗ ∗ ( , ) ˙ the Nash equilibriumw x is an in te rio r point of th e joint actionW se×t φ( X) n in w hich c ase , x˙ ∗ )at the (partial) gr adie nts in ( 9) a n d (10) are( w z ∗ero . F or re gularizer (23), th is imp liesw∗thisat 2640 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS x˙ i w here all weighting f acto rs ar e non-zero sin ce a linear c o mb ination of the transformed instances = 1,..., n. the fi r st d e ri v ati v e of the lo gis tic loss as w ell as the cqc1,ioastref actor non-zsero f ori all The support vecto r mac hine (SV M), whic h emplo ys hing the e loss , ℓ h( z, y) := max( 0, 1 q yz) = _ 1 q yz , if yz < 1 , 0 , if yz ≥ 1 does no t suf fer f rom non-sparsity , ho w e v er , the hinge los s ob vio u s ly vio la tes Assumptio n 2 as it is not twic e contin uously dif fer entiab le. Theref ore, we propose a twic e contin uously dif fere ntia ble loss function tha t we c tralligonometr ic lo, ss w hich satisfies Assumptions 2 and 3. D e fi nition 12For any fixe d sm oothnes s fac> tor 0, the s tr ig on ometr ic lo is ss defined by t ℓ ( z, y) := q yz , if y z < q s π ost 2syz_ , if |yz| ≤ s . 0 , if y z > s sq yz s 2 q πc ( 32) The tr ig ono metric loss is similar to the hinge loss in tha t it, e x c ept around the decis io n bound∈ Rion ary , penalizes mis c la ssifica tio ns in pr opo r tio n to the d ezcis and v alu attains e zer o for cor rec tly classified instances. Analogous to the once continuously dif fer entia ble H uber loss w he re a polynomial is embedded into th e hinge loss, th e trigonome tric loss c ombines th e per ceptr on lo ss ℓ p( z, y) := ma x( 0, q yz) with a trigonome tric function.This tr ig on ometr ical embeddin g yie lds a con v e x, twic e continuo usly dif fe rentiable function . t ( z, y) is c o n ve x and twic e c on tinuously dif fer entiable with r e Lemma 1 3The trigonom etric lo ℓss spect to z∈ R for any fixed y∈ Y . Pr o ofLet . qy t′ ℓ ( z, y) = q 1 1 t π _ 2 y + 2 y sin 2s yz 0 t ′′ ℓ ( z, y) = π 4s c 0 π ost 2syz_ 0 , if yz < q s , if |yz| ≤ s , if yz > s , if yz < q s , if |yz| ≤ s , if yz > s denote the fi r st and se cond deri v a ℓtit (vz,es respecti v ely , with respecz t∈toR. T he tr ig onoy) ,of ′′ t ℓ ( z, y) is con v e xzin ) is stric tly ∈ R (f o r an y fixyed ∈ Y ) as the sec on d deri vℓat (tiz,vye me tric loss positi v e |ifz| = |yz| < s and zer o other w is e . Mo r eo v er , sin ce th e se cond deri v a ti v e is c on tinuous, ′′ lim ℓ t ( z, y) = |z|→sq π π ′′ c os_ ± _ = 0 = lim ℓ t ( z, y) , | |→ + z s 4s 2 the trigonome tric loss is als o twice continuously dif f erentiable . Because of the simila r ities of the loss f unctio ns, w e call the N ash predic tion g ame that is based upon the trigonome tric loss Nash support v ector mac hine (N SVM) where we ag ain consider the re gula rizers in (22) and (23) . 2641 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R Defi nition 14 T heNash support v ector machine (NSV M ) is an insta nce of the Nash pr edic tion g ame ⊂ Rm and φ( X) n ⊂ Rm∙n and cost functions with non-empty , compact, and con v e x actionW spaces n θˆ tq 1( w, x˙ ) := 1 cq 1,i ℓ t ( wT x˙ i , yi ) + ρq 1 kwk22 ∑ 2 = i 1 n θˆ t+ 1( w, x˙ ) := ∑ c+ 1,i ℓ t ( wT x˙ i , q 1) + ρ+ 1 i= 1 ( 33) 1 n 1 kx˙ i q xi k22 n i∑ 2 =1 t wh e rℓ e is specified in( 32). The follo wing c o r olla r y sta tes suf ficie nt conditions under whic h the N ash support v ec tor machine satisfies Assumptions 2 and 3, and consequently ha s a un iq ue Nash equilibr ium. C or o llary 15Let the Nash support v ector m ac hin e be specifi ed as in D efinition 14 with positive r e gulariz ation p ar am ρeter s ρ+ 1 whic h satis fy q 1 and ρq 1ρ+ 1 > ncTq 1c+ 1, ( 34) then Assum ptions 2 and 3 hold, an d consequently , the Nash suppo r t vecto r m ac hine has a unique Na s h equilibrium. , y)ith:= ℓ t ( z, y) and Pr o ofB. y D e fi nition 14, both pla ye rs emplo y the trigonome tric loℓ qss 1( zw ℓ + 1( z, y) := ℓ t ( z, q 1) and the re gula rizers in (22) and (23) , respecti v ely . A ss u mp tio n 2 holds: ℓ t ( z, y) , a n d consequently ℓ q 1( z, y) and ℓ + 1( z, y) , are con v e x and 1. A ccor ding to L emma 13, tw ice c o ntinuously d if f erentiable with respect z ∈ R to (for an y fi x ed y ∈ {q 1, + 1} ). 2. T he re gulariz e rs of the Nash support v ecto r machine a re equal to th at of the N ash logis tic re gre ssion and possess the same properties as in Theorem 11. 3. B y Definition 14, the pla yers’ a ctio n sets ar e non-e mpty , compact, and con v e x subsets of fi nite-dimensional Euclid e an space s. A s su mp tio n 3 holds: , y) andℓ + 1( z, y) are equal for allz ∈ R since 1. T he second der i v ati vℓ qes of 1( z ′′ ℓ t ( z, y) = _ π 4s c π o ts2sz_ , if |z| ≤ s 0 , if |z| > s ∈ Y. do e s not dependenty on q 1: as f o r 2. T he sum of the first deri v a ti v es of th e lo ss f un c tio ns is bou y = nded ℓ ′q 1( z, q 1) + ℓ ′+ 1( z, q 1) t′ = 2ℓ ( z, q 1) = 2642 0 1 q sint q 2 π _ 2sz , if z < q s , if |z| ≤ s ∈ [0, 2], , if z > s S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS and fory =+ 1: ℓ ′q 1( z, + 1) + ℓ ′+ 1( z, + 1) = q1 , if z < q s π sint 2sz_ , if |z| ≤ s ∈ [q 1, 1]. 1 , if z > s T ogether w ith Equation 14, it follo ws th at τ= 1_′ ′ _ _ℓ q 1( fw ( x) , y) + ℓ + 1( fw ( x) , y) _≤ 1. ( x,y) ∈ φ( X) × Y 2 su p λ q 1 =are T he sma llest eigen v alu es of the players’ re gula rizers 1 and λ + 1 = inequalities 1 ρq 1ρ+ 1 > ncTq 1c+ 1 ≥ τ 2 cT c+ 1 λ q 1λ + 1 q 1 1 n, such tha t ho ld . ˆd+ ient ˙ ) = 1n ( x˙ i q xi ) of the da ta 3. A s f or N ash logis tic r e gres sion, th e par tia l Ñ gra x˙ i Ω 1( x, x generator ’ s re gulariz e r is in dependent x˙ j for of all j 6 i andi = 1,..., n. = Because Assumptio ns 2 and 3 are satisfi e d, the e xis tence of a unique Nash equilibriu m follo ws imme diate ly from Theorem 8. 6. Experi menta l Ev aluat ion The goal of this section is to e xplo re the relati v e strengths and w e aknesses of the dis cussed in stan c es of the Nash pr edic tion g ame and e xistin g r efer ence methods in the c o nte xt o f email spam filtering. W e compar e a re gular (SVM ) lo , gis tic r e gr ession ( L R), the support vec tor m ac hine SVM (SVMT , a v a ria nt of the SV M w hich min imiz es ( 33 ) f o r th e gi v en tr a ining with tr igonometr ic loss data), th e w orst-case solution v ar -SVM , G lo berson SVM for in variances with fe atur e r e m(In o val and R o w eis, 2006; T eo et al., 2007), and th e N ash equilibrium str N ash a te lo gies gistic r e gr essio n (NL R) andNash support vecto r ma c (hine N SV M) . data se t ESP Mai li n glis t P r i v at e TRE C 20 07 ins ta nce sf ea t u r es 169, 61 2 128, 11 7 108, 17 8 75,49 6 541,71 3 266,37 8 582,10 0 214,83 9 del iv er y peri o d 01/ 0 6/20 07 - 27 /04/20 10 01/ 0 4/19 99 - 31 /05/20 06 01/ 0 8/20 05 - 31 /03/20 10 04/ 0 8/20 07 - 0 7/06/20 07 T able 1: Data sets used in th e e xp e riments . W e use four corpor a of chronologically sorted emails detaile d in T able The1first : data s et contains ema ils of an email se rvic e pro vider (ESP ) collected betwee n 20 07 a n d 2010. Th e second (Mailinglist) is a collection of emails fr o m publicly a v a ila ble ma iling lists augmen ted by spam emails f rom Bruce G uenter’ s spa m tr a p of the same time per iod. The third corpus (Pri v ate) contains 2643 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R ne wslette r s and spam and non-spam ema ils of the authors. The la st corpus is the N IST TRE C 2007 spam corpus. φ( xg) is defined such th at email Featu re mapp in x ∈ X is to k eniz e d w ith the X-tok eniz er (Sie f k es et al., 2004) and con v erted in m to-dimensional th e bina ry bag- of-w ord v xector :=[ 0, 1]m. T he v a lue of m is determined by the number of distin ct terms in th e data set w her e we ha v e re mo v ed all te r ms w hich oc cur only once. F or eac h e xper imen t a n d each repetition, we then construct the PC A map, x′ ) := xT x′ ping (26) with re sp e ct to the c orresponding n tr a ining ema ils using the linear k ke( xrnel n resulting inn-dimensional tr ain ing instanφcP es CA( xi ) ∈ R for i = 1,..., n. T o ensure the con v e xity as well as th e compactne ss r equir e me nt in Assumption 2, we notio nally restrict the pla ye rs’ action φ( X) := { φP CA( x) ∈ Rn | k φP CA( x) k22≤ κ } and W := { w ∈ Rn | k wk22≤ κ } f or sets b y defining κ . N ote that by choosin g a n arbitrarily lar κ , th some fix ed constant gee pla yer s’ action sets become ef fecti v ely unb ounded. F or both algorithms , ILS a nd ED S, weσse := t 0.001, β := 0.2, and ε := 10q 14. Th e alg orithms ar e stoppedl ife xceeds 30 in line 6 of IL S and line 5 in E DS , respe cti v ely; in this case, no con v er gence is achie v ed. In all e xperiments , we use th e F-measure—that is , the harmonic me a n of precision and rec all— as e v alua tio n measure and tu ne all par ame ters w ith r espect to lik elihood. The particular pr o to col and re su lts of e ach e xperiment are d e ta iled in the follo w in g sections. 6.1 Con v er ge n ce Co r ollarie s 11 (f or Nash logis tic re gression) and 15 (for the N a sh su pport v ecto r ma c h in e) specify conditions on th e re gulariz a tio n para ρ me a ndρ+ 1 under wh ich a un iq ue N ash equilibriu m q 1 ters necessa rily e xis When ts. this is the ca se, both the ILS and EDS a lgorithms w ill c o n v er g e on tha t N a sh equilibrium. In the fi r st se t of e xper ime nts, we study w hethe r re p e ate d r esta r ts of the algorit con v er ge on th e sa me equilibrium when th e bounds in E quations 30 and 34 a re satisfi ed, and when the y are violate d to inc reasin g ly lar ge de grees. ρq 1 > ρ1+ 1 both bou nds (Eq ua W e setcv,i := 1n for v ∈ {q 1, + 1} and i = 1,..., n, s u c h that for tions 30 and 34 ) are satisfied. F or e ach vρqalue ofρ+ 1 and e ach of 10 re p e titio ns, we r ando mly 1 and 0) ( 0) ( w( solutions , x˙ ) dra w 400 emails fr o m the data set and r un ED S w ith r ando mly chosen in itial until con v er gence. W e run ILS on the same tr ain ing set; in eac h r epetitio n, w e ra nd omly choose a dis tinct initial solution, and af te r each iteration k w e compute the E uclidean dis tan c e between the ( k) w te ED S solution and the c u r rent IL S itera . Figure 1 r eports on th ese a v erage E uclidean dista nc es b e tw e en distin c tly initialized The runs. blue curv esρq( 1 = 2 ρ1+ 1 ) satisfy Equation s 30 and 34, the yello w curv eρsq 1( = ρ1+ 1 ) lie e xactly o n th e bou ndar y; all other curv es vio la te th e bounds. D otted lines sho w the Euclidean dis tance between the N ash equilibriu m and the solu tio n of logis tic re gression. Our findings are a s follo ws. L ogistic re g r ession and re gula r S VM ne v er coincid e with the Nash q2 equilibriu m—the Euclidean dista nces lie in the range between and 1 0 2. I L S and ED S al w ays con v er ge to ide ntic a l equilibria when (30) and (34) ar e satisfi ed ( blue and yello wThe cur v es). Euclid ea n dis tan c es lie at the th reshold of nume rical c omputin g ac curac y . Wh e n E quations 30 and 34 are v io la ted by a f acto r up to 4 (tu rquois e and r ed curv es ), all repetitions still c o n v er g e on the same equilibrium, indicating th at the equilibriu m is either s till unique or a secondary equilibriu m is unlik ely to be f oun d. When th e bounds ar e violate d by a f actor of 8 or 16 (gre en and pur p le curv es), th en some repetitions of the learning a lgorithms do no t con v er ge or start to con v e r ge to 2644 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS ρ+ 1 =2 6 ρ+ 1 =2 0 4 0 10 10 -2 -2 10 10 -4 -4 dis ta nc e to N E 10 dis ta nc e to N E 10 -6 -6 10 10 -8 -8 10 10 0 10 20 iterations ρ+ 1 =2 30 40 0 0 10 -2 10 -4 10 10 dis ta nc e to N E -6 10 10 -8 10 10 ρ− 1 =2 1 ρ+ 1 30 40 30 40 0 -2 -4 dis ta nc e to N E 10 10 20 iterations ρ+ 1 =1 2 10 0 10 20 iterations ρ− 1 = 1 ρ+ 1 30 ρ− 1 =2 -8 0 40 −1 1 ρ+ 1 -6 ρ− 1 =2 −2 1 ρ+ 1 10 ρ− 1 =2 20 iterations −4 1 ρ+ 1 ρ− 1 =2 −6 1 ρ+ 1 Figure 1: A v erage E uclidean d ista nce between the E DS solution and the ILS solu tio n a t iteration The dotted lin es sho w the k = 0,..., 40 for Nash lo gis tic r e gre ssion on the ES P corpus. dis tance betwee n the E DS solution and the solu tio n of lo g istic r e gressio E rr or bar n. s indicate sta ndar d de viatio n. dis tinct equilibria.In the latter case, learner and data genera tor ma y attain dis tinct equilibria and ma y e xperience a n arbitrar ily poor o utcome when playing a N a sh equilibrium. 6.2 Regu la riz at ion P arameter s ρv of The re gulariz a tio n paramete r sthe pla ye rs v ∈ {q 1, + 1} pla y a major role in the predic tion g ame.The learner ’ s re gularizer d e te rmines the ge n e ralization ability of th e pr edic ti v e model and the data generato r’ s re gula rizer contro ls the a mount of change in th e d a ta g e neration In proce ss. order to tu ne these paramete r , one w o uld need to ha v e access to labele d d a ta tha t ar e go v e rned b the tr a n sf orme d input dis trib ution. In our sec o nd e xperiment, we w ill e xplo re to w hich e x tent those paramete r s can be estimated using a por tio n of the ne west training data. Intuiti v ely , the most rec ent training data ma y be more simila r to the te st data th an older trainin g data. 2645 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R Performance Fixed ρ+ 1 1.0 0.9 0.95 ho ( ρ+ 1 =1024) ho ( ρ+ 1 =8) ho ( ρ+ 1 =4) te ( ρ+ 1 =1024) te ( ρ+ 1 =8) te ( ρ+ 1 =4) Fm ea su re 1.00 0.8 hold-out datahold-out data 0.7 0.90 -3 -2 10 10 ρ− 1 Fixed ρ− 1 0.85 1.0 Fm ea su re 10 -1 0.80 0.9 test data test data 0.70 0 10 ho ( ρ− 1 =0 .125) ho ( ρ− 1 =0 .002) ho ( ρ− 1 =0 .0001) te ( ρ− 1 =0 .125) te ( ρ− 1 =0 .002) te ( ρ− 1 =0 .0001) 0 10 Fm ea su re 0.75 0.8 2 10 -2 10 -4 10 4 10 ρ+ 1 ρ− 1 0.7 0 10 1 10 2 ρ+ 1 10 10 3 Figure 2: Left: Perf o r ma nce of NLR o n the hold -out a nd the te st data w ith r espect to re gulariz ation paramete r Right: s. Performance of NL R on th e hold-out data (ho) and the te st data (te) ρ for fix ed v alues ofv. W e split the data set into three par ts : Th e 2,0 0 0 oldest emails constitute th e trainin g portion, we use the ne xt 2,000 ema ils as hold-out portion on w hic h the par ame ters ar e tu ned, and the remaining emails are used as test set. W e r andomly dra w 20 0 spam and 200 non-spam messages fr om th e tr ain ing portion and dra w another subset of 400 ema ils fr o m th e h old -out portio n. Both N PG instances are traine d on the 400 trainin g emails and e v aluated ag ainst a ll ema ils of the te To st tune po r tio n. the par ame ters, we conduct a grid sea rch ma ximizin g the lik elihood o n the 400 hold-out emails. W e repeat this e xperiment 10 times f or all four data sets and repor t on the r esulting paramete rs as w e ll as the “optimal” r efer ence paramete r s ac cording to the ma ximal v alu e of F - me asure on the test set. T hose optimal re gula r iz ation para me te r s ar e not used in late r e xperiments . T he intuition of the e xper ime nt is that the data gener atio n process has already b e en cha n ge d between the oldest and the late st emails . This change ma y cause a dis trib ution shif t whic h is re fl ec te d in the h old -out portion. W e e xpect that on e can tune e ach pla yer s’ r e gula r iz ation para me te r by tuning with respect to this hold-out set. I n Figur e 2 (left) w e plot the perfor ma nce of the Nash logis tic re gressio n ( N LR ) on th e hold -out ρq 1 and ρr+ s1. T he dashed line vis ualizes the and the test data ag ain st the r e gula r iz ation paramete bound in ( 30 ) on the re gula rization par ame ters for whic h N LR is guar ante ed to possess a unique ρqng Nash equilibrium. Figure 2 ( right) sho ws sectional vie ws of the le ft plot alo th e(upper 1-axis ρ+ 1-axis (lo w e r diagram) for se v era l v alues ρ+ 1 of dia gr am) and the and ρq 1, res p e cti v ely As . e xpec te d, the ef fe ct o f th e re gula rization para me te r s on the test data is much stronger than on the hold-out data. 2646 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS ρ+ 1 ’ has I t tu rns out that the data g e nerator s almost n o imp act on the v alu e of F - me asur e on the h old -out data set (see lo w er right dia gr am of F igHence, u r e 2).w e conclude tha t estimating ρ+ 1 withou t a ccess to la beled d a ta from the test dis trib u tion or additional kno wle dge about the data gener ator is dif fi c u lt f o r this application; the most recent trainin g da ta are still too dif fere nt fr o m the te st data I n .all remaining e x pe riments and f or all data setsρw 8 f or NL R and + 1e=set ρ+ 1 = 2 f or NS VM. F or those choices the Nash models perf o r me d gener ally b e st on the hold -out set for a la r g e v ariety of v alu ρes . F o r In v ar - S VM the r e gula r iz ation of the data gener ator’ s q 1of transformatio n is c o ntrolled e x plicitly by th e Knumber of modifiable attrib ute s per positi v e insta nce . W e conducte d th e same e xper ime nt for In v ar -SV M resulting in an optimal v alu that e is of , K = 25; the data generato r is a llo w ed to r emo v e up to 25 tok ens of each s p a m ema il of the tr ain ing da ta se ρq 1 for gan y fix ed ρ+ 1 see ms From the upper r ight dia gr am of Figure 2 w e see tha t estimatin possible. E v en if w e slig htly o v erestimate the le ar n e r’ s optimal re gulariz a tio n paramete r —to c o m pensate for the dis trib utional dif f erence between th e transformed training sample and the ma r gina l shifte d ho ld -out set—the determin ed v ρ alue q 1 isofclose to the optimum for all f ou r data sets. 6.3 Ev aluation f or A dv ersary F ollo wing an E qu ilibrium Strategy W e e v alua te both a re gular c la ssifier traine di.i.d. under assumption the and a mode l th at f ollo w s a N a sh equilibr ium str a te gy ag ain st bo th an adv er sary w ho does n ot transfor m the input dis trib ution and a n adv er sary w ho e x ecute s the N as h - equilibria l transformatio n on th e input dis trib u tion. S in w e cannot be certain that actual spam s enders play a N ash equilib r ium, we use the follo w in g semiartificia l setting. The learner o bse rv es a sample of 200 spam and 200 non-spam e ma ils dra w n f rom th e tr a ining ˙ ; the tr i via portion of th e data a nd e stimate s the Nash-optimal pr edic tion model w ith w paramete r sl w baseline solu tio n of re gula rized emp ir ical ris k minimization (ER M) is denote . The d b ydata genera tor observ edis s atinctsamp leD of 200 spam and 200 non-spam messages, als o dra w n fr om ˙. the tr a ining p or tio n, and compute s their Nash-optimal D response 1 W e a g ain cset v,i := n for v ∈ {q 1, + 1} andi = 1,..., n and study th e f o llo wing four scenario s: • ( w, D) : B oth pla yer s ignor e the presence of an oppon e nt; tha t is, the learner emplo ys a re gular cla ssifie r a nd the sender d oe s not change th e data gener atio n process. • ( w, D˙ ) : The le a rner ignor es the pre sence of an a cti v e data ge n e rator w ho changes the data g e neration process suchDthat e v olv esD˙tob y pla ying a Nash strate gy . • (w ˙ , D) : The learne r e x pe cts a rational data generato r and c ho ose s a Nash-equilibr ia l predictio n model. H o w e v er , the data genera tor d oe s not change the input dis trib ution. • (w ˙ , D˙ ) : Both pla yer s ar e a w ar e of th e opponent and p lay a Nash-equilibrial action to sec ure lo west costs . W e re p e at th is e xperiment 100 times for all four data T able sets 2 . reports on th e a v era g e v a lues of F-me a su r e o v er all repetition s and both N PG instances and corres p onding baselines; numbe rs in α = 0(.05) between the F-measures fof boldf ace indicate significa nt dif fer ences ˙ for fix ed w and f w sampleD andD˙ , respe cti v ely . As e xpec te d, when the data gener ator d oe s not alter the input dis trib ution, the re gula r iz ed empiric al r is k min imiz atio n baselines, lo gis tic re gr ession and the SV M, a re generally Ho we v best. er , 2647 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R Ma iling lis t ESP NLR v s. LR D D˙ w 0.957 0.912 w ˙ 0.924 0.925 D D˙ ESP NSVM v s. SVM D D˙ w 0.955 0.928 w 0.987 0.958 ˙ w 0.984 0.976 P ri v ate D D˙ Ma iling lis t ˙ w 0.939 0.939 D D˙ w 0.987 0.961 w ˙ 0.985 0.976 w 0.961 0.903 w ˙ 0.944 0.912 T R EC 2 007 D D˙ P ri v ate D D˙ w 0.961 0.932 ˙ w 0.957 0.936 w 0.980 0.955 w ˙ 0.979 0.961 T R EC 2 007 D D˙ w 0.979 0.960 ˙ w 0.981 0.968 T able 2: Nash p r edic to r and re gular cla ssifi er a g ainst passi v e and Nash-equilibrial data generator . the per formance of those b a selin e s drops substa ntially w hen the data genera tor plays the Nashequilibrial actionD˙ . T he N a sh-optimal predictio n models are mor e r ob ust ag ains t this tr a n sf orma tion of the input distr ib utio n a n d sig n ific antly outpe rform the ref erence methods for a ll four data sets. 6.4 Case Study on Email Sp am Filtering T o stu dy the perfor ma nce o f the N a sh predictio n models and the baselines for email spam fi lterin g, w e e v alu ate all me thods into the fu tur by e pr ocessing the test set in c h r ono lo gic alT or hedtest er. portion of each data set is split into 20 chr o nolo g ically sorted disjoint subsets . W e a v er age th e v a lue of F - me asur e on each of th ose s u bse ts o v er the 20 mo dels (trained on dif fe rent samples dra wn fr o the tr ain ing por tio n) for each me thod a nd perfor m at -test. pairedI n the absence of in forma tion 1 = about pla yer a nd instance-spe cific costs , w e ag cv,i ain : nset for v ∈ {q 1, + 1} , i = 1,..., n. N ote, that the chosen loss f u nc tio ns and re gularizers w ould allo w us to se le ct a n y p ositi v e cost f acto rs with out viola ting Assumption 1. Figure 3 sho ws th at, for all data sets , th e NP G instances outp erfor m logistic re g r ession (LR ), SV M, and S VMT th at do not e xplicitly f a ctor th e adv ersa ry into th e optimiz ation E spe criterion. cially f or the ESP corpus, th e N a sh logis tic r e gressio n (NL R) and th e N a sh su ppor t v ec tor machine (NS VM) are superio On r . the TR EC 2007 data set, the me thods beha v e comp arably w ith a slight adv a n tage for the Nash supp or t v ecto r ma Thechin perio e. d o v e r w hich th e T RE C 20 07 data ha v e been collected is v er y sho r t; we belie v e that th e tr a ining and test instances ar e go v erned by near ly identical dis trib utions. Consequently , for this data set, the g a me the oretic models do not g ain a significa n t adv a nta ge o v er logis tic r e gre ssion and th e SVi.i.d M that .samples. assumeW ith respe ct to the non-g ame th eoretic baselines, th e re gular S VM o utp erfor ms L R and SVMT for mo st of the data sets. T able 3 sho ws aggre g a te d res u lts o v er all four data sets. F or each point in each of the dia grams Figure 3, w e c o nduct a pairw ise comp aris on o f all me th ods based on staatpair a confidence ed t -te α = . le v el of 0 05. W h e n a dif fere nce is sig nificant, we c ou nt th is a s a win f o r th e method th at achie v es a hig h e r v alue of F -measure. E ach line of T able 3 details th e wins and, set in italics, the lossesof one method ag a inst all other me th ods. The N ash logistic re gressio n and the N a sh support v e ctor ma c h in e ha v e more wins than th e y ha v e lo sses ag ainst e ach of the othe r methods. The ranking contin ues with In v ar -SV M, the re gular SV M, logis tic re gression a nd th e tr ig on ometr ic loss SV M whic h lose s mor e fr equently th an it wins ag ain st a ll other me thods. 2648 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS Performance on ESP corpus Performance on Mailinglist corpus 0.88 0.95 Fm ea su re 0.99 Fm ea su re 0.94 0.91 0.82 0.76 Dec07 Jul08 Jan09 0.87 Aug09 Performance on Private corpus Oct02 Feb04 Jul05 Performance on TREC 2007 corpus 0.97 0.99 0.86 Fm ea su re Fm ea su re 0.97 0.75 0.64 0.95 Nov06 SVM Apr08 LR TSVM 0.93 Aug09 Invar-SVM Apr07 NLR (ILS) May07 Jun07 Jun07 NSVM (ILS) F ig u r e 3: Value of F-measure of pr edic ti v e mode ls . E r ror bars indic a te sta ndard err o r s. me tho d vme s . tho d S VM LR S VM 0:0 40:2 LR 2:40 0:0 S VMT 0:53 5:49 I n v ar -SVM 20:30 29:19 NLR 57:8 59:5 N SVM 65:2 71:2 S VMT I n v a r -SV M NL R N SVM 53:0 30:20 8:57 2:65 49:5 19:29 5:59 2:71 0:0 9:47 2:70 2:74 47:9 0:0 5:57 3:57 70:2 57:5 0:0 22:30 74:2 57:3 30:22 0:0 T able 3:Results of pair etd-test o v er all corpora: N umbe r o f tr ials in whic h each me th o d (r o w) has sig nificantly outpe rformed each other method (column)vs. numbe r of times itwas . d outpe rform e 6.5 Efficie ncy v ersus Effe cti v en e ss T o a ssess th e predicti v e performance as w e ll a s the e x ecution time a s a f un c tio n of the sample siz w e train th e baselines and the tw o N PG instances f o r a v ar y in g numbe r of training W ee xamp le s. 2649 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R Performance on ESP corpus Execution time on ESP corpus 0.95 3 10 0.9 1 10 ti m e in se c Fm ea su re 0.85 0.8 -1 10 0.75 50 100 200 400 800 1600 3200 number of training emails 50 100 200 400 800 1600 3200 number of training emails SVM LR TSVM Invar-SVM NLR (ILS) NSVM (ILS) NLR (EDS) NSVM (EDS) Figure 4: Predic ti v e per forman c e (left) and e x ecution time (r ight) for v arying siz es of the tr a ining data set. repor t on th e results for the ES P da ta set in F igurThe e 4.g ame the o r etic models significantly outper form the tri via l baseline me th o ds logis tic r e gre ssion, the S VM and th e SV MT , especially f or sma ll d a ta sets H o. w e v er , this comes at the pric e of considera bly higher computatio The nal cost. ILS algorithm re q uires in ge n e ral only a c o uple of iterations to con v er ge; ho w e v er in each iteration se v eral optimiz ation pr o blems ha v e to be solv ed so that the total e x ecution time is up to a f acto r 150 lar ger th an tha t of the corre spon din g ERM base lin e. In contrast to the ILS a lgorithm, a sin g le iteration of the ED S a lgorithm does n ot require solvin g nested optimiz ation problems . H o w e v er , the e x ecutio n time of the ED S algor ith m is still higher a s it often r equir es s e v er al thousand iterations to fully con v e r ge. F or la r ger data sets , th e dis cr epanc y in predicti v e per forman c e between g a me theor etic models and O ur r esults do n ot pro vide c on c lusi v e e vidence i.i.d. baseline decr eases. w he ther I L S or E DS is f aste r at solving th e optimiz ation problems . W e conclude that the bene fi t of the NPG pr edic tion models o v er the classification baseline is greatest for small to med iu m sample sizes. 6.6 Nash - E quilib ria l T ransf ormation In contrast to In v ar -SV M, th e N ash models a llo w the data gener ator to modify n on- spam emails. H o w e v er in practice most senders of le gitimate messages do not deliberately change their writing beha vio r in order to bypass spam filters, pe rhaps with th e e x c eptio n of senders of ne w s le tters w ho must be caref u l not to tr ig g e r fi lterin g me c hanis ms . In a fi nal e xperiment, we w ant to study w heth er the N ash model reflects th is aspect of reality , and ho w the da ta ge n e rator ’ s re gulariz er ef fe cts this transformatio The trainin n. g portion contains , 000 4 ag ainn+ 1 = 200 spam andnq 1 = 200 non-spam instances randomly chosen f rom the oldest 2650 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS emails . W e dete r min e the N ash e q uilibr ium and measure th e nu mb er of a dd itions a nd d e le tions t spam and non-spam emailsD˙in : d Δad q 1 := 1 nq 1 Δdel q1 1 nq 1 := ∑ m ∑ ma x( 0, x˙ i , j q xi , j ) Δadd + 1 := 1 n+ 1 ∑ ma x( 0, xi , j q x˙ i , j ) l Δde +1 1 n+ 1 i :yi = q 1 j = 1 m ∑ i :yi = q 1 j = 1 := m ∑ ∑ max( 0, x˙ i , j q xi , j ) ∑ ∑ max( 0, xi , j q x˙ i , j ) i :yi =+ 1 j = 1 m i :yi =+ 1 j = 1 Δav ddand Δdel xi , j in dic ates th e pre sence o f to wh e re j ink the en i -th trainin g email, that is, v d e no te the a v era ge number of w or d addition s and dele tions per s p a m and non-spam ema il perfor me d by the sender . Figure 5 sho w s the number of additio ns and de le tions of the N a sh transformatio n as a function of the adv ersar y’ s r e gula r iz ation paramete r f or the ESPTdata able set. 4 r eports on th e a v er age number of w or d additio ns and d e le tions for a ll data F sets or I n. v ar - S VM, w e set the number of possible deletion s K to= 25. ES P Ma ili n gli st g ame n on - spa m s pa m mo del ad d del a dd del In v a r - S VM0.0 0.0 0.0 24.8 NLR 0.7 1.0 22.5 31.2 NS VM 0.4 0.5 17.9 23.8 g am e no n-s pa m m odel ad d d e l I n v ar -SVM 0.0 0.0 NLR 0.3 0.4 N SVM 0.3 0.3 Pri v a te s p am add del 0.0 23.9 8.6 10.9 6.9 8.4 T REC 2 00 7 g ame n on - spa m s pa m mo del ad d del a dd del In v a r - S VM0.0 0.0 0.0 24.2 NLR 0.4 0.2 24.3 11.2 NS VM 0.1 0.1 15.6 7.3 g am e no n-s pa m m odel ad d d e l I n v ar -SVM 0.0 0.0 NLR 0.2 0.2 N SVM 0.2 0.1 s p am add del 0.0 24.7 15.0 11.4 11.1 8.4 T able 4: A v e rage nu mb er o f w ord a dd itions and deletio ns per tr ain ing e ma il. The N ash- equilibria l transfor ma tion imp oses almost no changes on an y non - spam ema il; the number of modifications declines as the re gula rization paramete r gro ws (see Figure 5) . W e observ e for all data sets that e v en if the total amount of transformatio n dif fer s for NL R a n d NS VM, both instances beha v e simila r ly insof ar as the number of w ord additions and dele tions contin ues to gro w w he n the adv ersa ry’ s re gu larizer decreas es. 7. Co ncl usion W e studie d pr edic tion g ames in whic h the le ar n e r and the data gene rator ha v e c on flictin g b ut not necessa rily d irectly antagon istic cost functions. W e f ocused on sta tic g ames in w hich le ar ner and data gener ator ha v e to commit simu ltaneously to a p r edic ti v e mo del and a transforma tion of the input dis trib ution, respe cti v ely . The c o s t- minimizing ac tio n of each pla yer depends on the opp onent’ in sthe moavbsence e; of infor ma tion about th e opponent’ s mo v e, pla yers ma y choose to pla y a Nash equilibr ium strate gy 2651 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R Amount of transformation for NLR 70 Amount of transformation for NSVM 35 add non-spam additions -1 60 -1 +1 -1 add spam additions +1 20 del spam deletions +1 30 del non-spam deletions 25 add spam additions 40 -1 30 del non-spam deletions 50 add non-spam additions del spam deletions +1 15 nu m be r of m od ifi cat io ns 10 nu m be r of m od ifi cat io ns 20 10 0 10 0 1 10 ρ+ 1 5 10 0 2 10 0 1 10 ρ+ 1 10 2 Figure 5: A v erage number of additions and deletio nsper spam/no n- spam ema ilfor N LR (left) and N SV M (rig ht) with respect to the adv ersar y’ s re g ularizatio nρ+para fix ed te r 1 forme ρq 1 = nq 1. wh ich constitutes a cost- minimizing mo v e for eachif player th e othe r player follo ws th e equilibr ium as w ell. Because a combin atio n of actions f rom distin ct equilibr ia may le a d to ar bitrarily high c o sts for either player , w e ha v e studied conditions under whic h a predictio n g ame c an be guar ante ed to possess a unique N as h equilib r ium. L emma 1 identifies conditions un de r w hich at least one equilibriu m e xistsa,n d T heore m 8 e la borates on when this equilibrium is u niq W eue. propose an ine x a ct lin e search approac h and a modified e xtragradient approach for ide n tifyin g th is unique equilibriu m. E mpiric a lly , both approache s perf orm quite simila r ly . W e der i v ed Nash lo g istic re gression and Nash sup por t v ector ma c h in e models and k ernelized v er sions of these me thods. Co r ollarie s 1 1 and 15 sp e cia lize T heore m 8 and e xp ound conditions on the pla ye r’ s re gula rization pa rame ters under whic h the Nash logis tic re gression a nd th e s u pport v ecto r ma chine con v er ge on a unique Nash equilibrium. Empiric ally , we find that both methods identify u niq ue N a sh equilibria when the bo unds la id out in Corollaries 11 and 15 are satisfi e d or violate d by a f actor of up to 4. F rom our e x pe riment on se v eral ema il c orpora we conclude that Nash logistic re gression and th e s u pport v ector ma c h in e outper thelin ir es and I n v a r -SVM i.i.dform .base for the proble m of classif ying futu re e ma ils based on training data from th e past. Ackno wledgments This w or k w as supported by the Germa n Scie nce F oundation ( D FG )under gr ant SC HE 540/12- 1 and b y S TR AT O A G. W e tha nk Nie ls Landw ehr and C hristoph Sa w ade for constructi v e comme nts and sugg e stio andns, the anon ymous r e vie wer f or helpf ul contrib utions and car eful proofre ading of the ma nuscr ipt! 2652 S TATICP REDICTIONG AMESFORA DVERSARIALL EARNINGP ROBLEMS Ref e r e nces T amer B a sar and Geert J . O Dynam lsder . ic N oncooper ative G ame T. heory S ociety for I n dustria l and Applie d Mathematic s , 1999. Ste phen B o yd a n d Lie v en Vandenber Co n ghe. ve x Optimiz ation . Cambr idge U ni v ersity P r ess, 2004. Mic ha el Br ¨uckner and T obia s S chef Nashfeequilibria r. of sta tic pre dic tion g ame In A s. dv a nc es in N e u r al I nform ation Pr ocessin g .System MI T Press, s 2009. Mic ha el Br ¨uckner and T obia s S cStack h e f elber f er . g g ames for adv ersarial predictio n proble In ms . P r oce edings of the 17th A CM SIG KD D Inte rnational Co nfer ence on K nowle dg e D isco very and D ata Minin g (KD D), San Die go, C A,. AUCSA M, 2011. O f er Dek el and Ohad Shamir L ear . ning to cla ssify w ith missing and corrupted f eature I n Prs.oceedings of the I nte rnation al Confe r e n c e on M ac hine .AC L eM,a 2008. r n in g O f er D e k el, O had S hamir , a n d Lin X iao. Lear n in g to c la ssify with missing and corrupted fea tures. Mac hin e Learning , 81(2):1 4 9–178, 2010. Carl Geiger and Chris tian K a n z Theorie o w . und Num erik r estr in g ierte r O ptimierungsa. ufg aben Sp r inger , 1999. Laurent El Ghaoui, Gert R. G. L anckr ie t, and Geor ges Nats Rob oulis.ust classification w ith inte r v al data T .echnic al Report UC B/CSD -03-1279, EE CS D epar tmen t, U ni v er sity of California, B e rk ele y , 200 3. Amir G lo b e rson and S a m T . RNo ig w heis. tmare at te st time: R ob u s t learning by featu re de le tion. In P r oceedin gs of the Inte rnational C onfer ence on Mac hine. ALearning CM , 2006. Amir Globerson, Choon Hui T eo,A le x J. Smola,and Sam T . Ro weisA. n adv ersa ria l vie w of co v a ria te shift and a min imax approach. In Dataset Shift in Mac h in e L earning , pa g e s 179–198. MIT Pres s, 20 09. P atric k T . H ar k er and J o ng- S hi P ang. Finite- d imensional v ariatio nal in equality a n d nonlin ea r c o ple menta rity pr ob lems : A surv e y o f the o r y , algorithms andMathematic applications. al Pr o gr amm in ,g48( 2):161–220, 1990. Rog e r A . Horn a nd Charles R. Johnson. Topic s in Matrix An alysis . C amb ridge U ni v ers ity Press, C a mbridge, 1991. James T . K w ok and Iv or W. T sang. Th e pre- ima ge proble m in k ernel methods. I n Pr oc eedings of Interna tional Confe r ence on Mac h in e L earnin , pagesg40 8–415. AA AI Press, 2 003. ISB N 1-57735-189-4. Gert R. G. Lanckriet, Laure n t El Gh a ou i, C hiranjib B hattachar yya, and M ichael I. Jordan. A rob u st min imax approach to classification. J ournal of Mac hine L e a rnin g Resear , 3:5 55– c h 582, 2002. Sebastian Mika, Bernhar d Sch ¨olk opf, Ale x J. Smola , K laus-Ro be rt M ¨uller , Matthias S c h olz, and G unnar R ¨ats ch. K ernel P CA and de–noising in fe atureAdvances space s. in In N eur al I nform ation P r oc essing Syste . MIT ms Press, 19 99. 2653 B R¨U C K N E R , K A N Z O W , A N D S C H E F F E R J. B en Rosen. Exis tence and uniquenes s of equilibriu m points f or conca v e n-per Econoson g ame s. m e tr ic , 33(3) a :520–5 34, 1965. Bernhard Sch ¨olk opf and A le x J. SLmola. earnin g with K ernels . The MIT P r ess, C a mbridge, MA, 2002. Bernhard Sch ¨olk opf, R alf H er bric h, and Ale x J. SAmola. g e neralized repre sente r theorem. In C OL T : Pr oceedings of the Workshop on C omputatio nal L e a r n inMor g Tgan he oKaufm ry , ann P ublisher, s20 01. Ch r is tian Sie fk es, Fidelis A ssis, Sh a le ndra C hhabra, and W illiam S. Ye razunis . Combining W inno w and orthogonal sparse bigra ms f o r inc reme ntal spam filtering. P r oceeI n dings of the 8th E ur opean , v olu me(PK DD ) C onfer ence on Princ iple s and P r a c tic e of K nowle dg e Dis co ve ry in D atabases 3202of L ec tur e N otes in A rtificial I nte llig , paence g e s 410–421. Springer , 2004. Ch oon H ui T e o, Amir Globerson, Sam T . Ro weis , and Ale x J. S mo la . C on v e x le ar ning w ith in v a ances. InAd v an c es in Neur al Informatio n P r ocessin. gMSystem I T Press, s 2007. An na v o n Heusinger and C hr is tian K anzo w . Relaxatio n me th o ds f or generalized N a sh equilibr ium proble ms w ith ine xact lin e searJch. , 143(1): ournal of Optimiz ation Theory and A pplications 159–1 83, 2009. 2654