J ou rnal of Machine L ea rning Rese arch 1 3 (201 2) 2 279-22 92 S u bmit ted 8/11; Re vised 3/12; P u bli sh ed 8 / 1 2 P airwise S up port Vec t or Mac hin e s and th e i r A pp lication to Lar ge Scale Pr oblems C ar l Brun ner A ndr eas F isch e r C. BRUNNER@ GMX. NET ANDREAS. FISCHER@TU- DRESDEN. DE I ns tit u t e for Numeri cal Mathematic s Te c hn i sc he Un i¨ ve r s it a t Dr e sden 01062 Dr es d e n , Ger ma ny Klau s L uig Th orst en T hies LUIG@ COGNITEC. COM THIES@COGNITEC. COM Co gn i te c Sys te m s Gmb H Gr os senha i n er Str. 10 1 01127 Dr es d e n , Ger ma ny Editor: C ori n na C ort es Abst r a ct )be P a irwi se cl as si fi c ati o n i s the t as k t o pre d ic t whet ha,eb rotfhaepea(xiraa, bmp l es lon g to the sa m e cla ss o r t o d i f f er en tIncla p ar ss tiescular . , inter cl as s g e n e ral iz at io n prob l ems c an b e t re ate d in this w a yIn . pai rwis e cl as si fi c ati o n, th e order of the tw o inp ut e xample s s h ou l d no t af fe ct the cla ss ifica ti o n re sult . T o ac h i e v e t h i s, p a rti cula r k er n el s as w e ll as the use o f s ymmetr ic tr aining s in th e fr ame w ork o f sup po r t v ec tor ma ch i n e s were sug ges te d . The paper d i sc u s se s bo t h a p proa in a gen e ra l w ay a n d e st abli shes a s tr o ng con nect ion bet In aween d dit them. ion , an e f fici ent imp l ementa ti o n i s dis cuss ed wh i ch a llo ws the t ra ining of s e v er al mil li o ns of pai rs . Th e v alue o f thes e co ntr ib u t ion s is con firme d by e x c el lent res ults o n the la b e le d f ac es i n t h e w i ld b e n c h mar k . K e y w ords:p a ir w i se sup po r t v ect o r mac h i n e s, i n t erc la ss g ener al iza ti o n, p a ir w i se k e rnels , la sca le p r o blems 1. Intr o duction T o e xtend binar y classifier s to mu lticla ss classification se v e ral modifica tio ns ha v e bee n suggested, for e xample the one ag ainst all technique, the o ne a g ainst one te c hn iq ue, or directed ac yclic gra ph s see Duan and K eer thi (2005), H ill and D oucet ( 2007 ) , H su and L in (2002), and R ifkin and K lauta u (2004) for furth er information , dis cussio ns, and c ompariso ns. A more rece nt appr oach used in the field o f multic la ss and binar y classification is pairw ise classification ( A berne th y et al., 2009; Bar H ille l et al., 2004a,b; Bar -H ille l and W einsha ll, 2007; Ben-H ur and Nob le, 2005; P hillips, 1999; Vert et al., 200 7)P. air wis e classification relies on tw o in pu t e xamples in ste ad of one and predicts w he ther th e tw o input e xamples belong to the same c la ss or to dif fer ent cla sses. This is of particular adv a n tage if only a s u bse t of cla ss es is kno w n f or trainin g . F or la ter use, a support v ecto r machin (SV M) th at is a b le to handle pairw is e classification tasks is called pairw is e SV M. A natu ral requireme nt for a p a ir wis e cla s sifier is that the or der of the tw o in p ut e xample s should not in fl uenc e the cla ss ificatio n result ( symme A c try). ommo n approac h to e nforce this symme try is the use of selecte d k ernels. F or pairw ise SVMs, another approach w a s suggested. Bar -H ille l c 2012Carl B run ner , Andre as F isch er , Klau s L uig and T ho rst e n Thies. BRUNNER, F ISCHER,L UIGANDT HIES et a l. (2004a) propose th e use of trainin g sets with a symme tric str uc W ture. e w ill dis c uss both approac h e s to obtain symmetry in a genera Bl wa ased y . on this, we will pro vide conditions w hen these a p pr oaches le a d to th e same classifier Mor eo v. er , w e sho w empiric ally that th e approach of using sele cted k e rnels is thre e to four times f aster in trainin g. A typical pairwis e c la ssifica tio n task ar is es in f ace r ecognition. Th e re, one is of te n in te re ste d in the interclass generalization, w her e none of the persons in the trainin g set is part of the test set. W e w ill demonstrate th at tr a ining sets with ma n y classes (per sons) are nee d e d to obtain a good perf orma nce in th e interclass gene ralization. Th e training o n such sets is computatio nally e x pe n s i v e Therefor e, we d iscuss an ef ficie nt imple menta tion of pair wisTh e SisVMs. enable s th e tr ain ing of pairw ise SV Ms w ith se v er al millio ns of pairs. In this w a y , f o r th e labele d f aces in the wild data base a performance is a chie v ed whic h is super ior to the current state o f th e ar t. This pape r is structur ed as follo Iws. n S e ctio n 2 we gi v e a short intr oduction to pairw ise classification a nd dis c u ss th e symmetry of decision function s obtained by pair wis e SV Ms. A f te rw ards, in Section 3.1 , we analyz e th e symme try of decision functions f rom pairw ise S VMs that rely on symmetr ic trainin g sets T .he ne w connection between th e tw o approa ches for obta in ing symmetry is establis hed in Section 3.2. T he ef fi c ie nt imp leme ntatio n of pairw ise SV Ms is dis cussed in Section 4. F in ally , we pr o vide performance me asure me nts in S ec tio n 5. The ma in c o ntrib utio n of the pa p e r is th at w e sho w the e qu i v ale nce of tw o approac h e s for ob ing a symme tric classifier f rom pair wis e SV Ms and demonstrate the ef ficie nc y and good interclass genera liz a tio n pe rformance o f pairw ise SVMs on la r ge scale problems . 2. P a i rwise Classificat i o n Let X be an ar bitrary set and let m training e xamples xi ∈ X w ith i ∈ M ≔ { 1,..., m} be gi v e n. The cla ss of a tr ain ing e xamp le migh t be un kno w n, b ut we d e ma nd that we kno w f or each pair ( xi , x j ) of training e xamples whether its e xamp le s belo ng to the sa me class or to dif fer ent c la sses. ( xi , x j ) belong to the same c la ss and Accordingly , we d e fiyne i j ≔ + 1 if th e e xamples o f th e pair o ( , call it a positive pair. O th erwis e, we yset . i j ≔ 1 and c allxi x j ) a ne gative pair I n pair wis e cla ssific atio n th e aim is to decide whether the e x a mple( sa,of b) a∈ pair X× X belong to the same class or not.In this paper , w e w ill mak e use of pairw ise decis ion functions f : X × X → . Such a function pre dic ts whether the e xamples a, b of a pair ( a, b) belo ng to the same cla ssf ( a, b) > 0) or no t (f ( a, b) < 0). Note that neithera, b need to be long to the set of training e xamples nor the cla s ses a, b of need to belong to th e classes of th e trainin g e xamp le s. × X → . Let H de n ote an ar bitrary re al A commo n to ol in machine learning a re k ker: X nels h ,∙∙uc i. Ft orφ : X → H , Hilb er t spac e with scalar pr od R R k( s, t ) ≔ hφ( s) , φ(t ) i defines astandar dk ernel. I n pairw ise cla ssifica tio n one oftenpair usewis s ek e rnels K : ( X × X ) × ( X × X ) → . I n this paper w e assume that an y pairw ise k er nel is symme tric , th at is , it holds that R K (( a, b) , ( c, d) ) = K (( c, d) , ( a, b)) for all a, b, c, d ∈ X , and that it is positi v e semid efinite ( S ch ¨olk opf a nd S mo la , 2001). F or in sta nce , KD (( a, b) , ( c, d)) ≔ k( a, c) + k( b, d) , KT (( a, b) , ( c, d)) ≔ k( a, c) ∙ k( b, d) 2280 ( 1) ( 2) P AIRWISES V M SANDL ARGES CALEP ROBLEMS are symmetric and positi v e semidefi nite. W e callKD dir ect sum pair w ise k e and rn e Kl T tensor (cf . S c h ¨olk opf and S mo la , 2001). pair wis e k ernel A natura l and desirable property of an y pairw ise decision functio n is that it should be symmetr ic in the follo w in g sense f ( a, b) = f ( b, a) f or alla, b ∈ X. ⊆ M × M is gi v en. Then, th e pairwis e decis ion function No w , le t us assume Ithat f obtained by a pairw ise SV M can b e written a s f ( a, b) ≔ ∑ α i j yi j K (( xi , x j ) , ( a, b)) + γ ( 3) ( i , j ) ∈I with bias γ ∈ and α i j ≥ 0 for all ( i , j ) ∈ I . Ob viously , ifKD (1) or KT (2) are used, th en the decision f u nc tio n is n ot symmetr ic in gener al. T his mo ti v ates us toKcall balanced a k e rnel if R K (( a, b) , ( c, d)) = K (( a, b) , ( d, c)) f o r a all, b, c, d ∈ X holds. T hus, if a balanced k e rnel is used, then (3) is a l w ays a symmetr ic decision F f uornc tio n. instance, the follo wing k er nels are bala nc ed 1 KDL (( a, b) , ( c, d) ) ≔ ( k( a, c) + k( a, d) + k( b, c) + k( b, d) ) , 2 1 KT L(( a, b) , ( c, d) ) ≔ ( k( a, c) k( b, d) + k( a, d) k( b, c)) , 2 1 2 KM L(( a, b) , ( c, d) ) ≔ ( k( a, c) o k( a, d) o k( b, c) + k( b, d) ) , 4 KT M(( a, b) , ( c, d) ) ≔ KT L(( a, b) , ( c, d)) + KM L(( a, b) , ( c, d) ) . ( 4) ( 5) ( 6) ( 7) Vert et al. (2007) callKM L m etric le arning pair wis e k erne and Kl T L tensor learning pair wis e k er ), learning nel. Simila rly , we call KDL , whic h w as intr oduce d in Bar -H ille l e t al. (2004 dir ectasum and KT M te nsor metr ic le arning pa irw ise k. eF rnel or re presenting some balanced pair wis e k ernel k er n e ls by proje c tio ns see B runner et al. ( 2 011). 3. Symmetric P airwise Decis io n Func t i o ns a nd P a i rwise SVMs P airw ise SVMs lead to dec is ion function s of the forAs m de (3).ta iled a b o v e, if a balan c ed k er nel is used w ithin a pairw is e SVM, one al w ays ob tains a symme tric de cis ionFfunction. or pairw ise SV Ms whic h use KD (1) as pair wis e k e rnel, it has bee n claimed tha t an y symme tric set of tr a ining pairs le ads to a symmetr ic decision function ( see B a r -Hillel e t al., 2004a). W e c all a set of tr a ining pairs symmetr ic, if f or an y tr a ining (pair a, b) th e pair( b, a) also belongs to the trainin g set. In Section 3.1 w e pr o v e the cla im of B a r - H illel et al. (2004a) in a more g e neral c o nte xt whic h inc lude KT (2). A dditionally , w e sho w in Section 3.2 that under some cond itions a s y mmetric tr a ining γ. bias term set leads to the same decision function as balanced k ernels if w e dis r e g ard the SV M Interestingly , th e applic a tio n of balan c ed k ernels le ads to significa ntly shorter training times (s ee Section 4.2) . 2281 BRUNNER, F ISCHER,L UIGANDT HIES 3.1 Symmet ric T raining Sets In th is subsec tio n we sho w tha t the sy mmetry of a pair wis e decis io n function is inde ed achie v ed by me a ns of symmetr ic trainin g sets. T o this end, letI ⊆ M × M be a sy mmetric inde x set, in oth er w ords if( i , j ) belongs toI the n( j , i ) also belongs toI . Further more, we w ill mak e use of pairw ise k er n eKlswith K (( a, b) , ( c, d)) = K (( b, a) , ( d, c)) f or alla, b, c, d ∈ X. ( 8) As an y pairw ise k ernel is ass u med to be symmetric , ( 8) holds f or an y balanced pairw ise k ernel. Note that th ere are o th er pair wis e k er nels that satisfy (8), for instance for the k ernels gi v en in Equations 1 and 2. F orIR, I N ⊆ I defined byIR ≔ { ( i , j ) ∈ I |i = j } a ndIN ≔ I \ IR let us consider the dual pairw ise SV M min G( α ) α 0 ≤ α i j ≤ C for all ( i , j ) ∈ IN 0 ≤ α i i ≤ 2C for all ( i , i ) ∈ IR ∑ yi j αi j = 0. s.t. ( 9) (i , j ) ∈I with G( α ) ≔ 1 αiα j k lyi j yk lK (( xi , x j ) , ( xk, xl ) ) o 2 ( i, j )∑ ,( k,l ) ∈ I ∑ α i j. ( i , j ) ∈I Lemma 1 I f I is a symm etric inde x set and (if8)holds, then ther e is a solu tioαˆnof ( 9)w ith αˆ i j = αˆ j i for a ll( i , j ) ∈ I . ∗ Pr oof B y the theore m of W eie rs tr ass ther e is a α solution o f ( 9 )L. e t u s define a n oth er feasible point α˜ of ( 9 ) by α˜ i j ≔ α ∗j i for all ( i , j ) ∈ I . F or easier notatio n w eKset i j,k l ≔ K (( xi , x j ) , ( xk, xl ) ) . Th e n, 2G( α˜ ) = ∑ α ∗j iα ∗l kyi j yk lKi j,k lo 2 ( i , j ) ,( k,l ) ∈ I ∑ α ∗j i. ( i, j )∈I N ote th at yi j = y j i holds f or a (lli , j ) ∈ I . By (8) w e fur ther obta in 2G( α˜ ) = ∑ α ∗j iα ∗l ky j iyl kK j i,l ko 2 ( i , j ) ,( k,l ) ∈ I ∑ α ∗j i = 2G( α ∗ ) . (i , j ) ∈I α˜ is als The last equality ho ld s sinI is cea symmetric tr a ining set. H ence , o a solutio n of (9). S in ce (9) is con v e x (cf. Sch ¨olk opf and Smola, 2 001) , α λ ≔ λα ∗ + ( 1 o λ ) α˜ ∈ [y0, 1]. Thus,αˆ ≔ α 1/ 2 has the des ir ed property . solv e s ( 9 ) f orλan N ote that a result similar to Lemma 1 is presented by W ei et a l. (2006) for Suppo r t Vector Regressio n.The y , ho we v e r , cla im tha t an y solu tio n o f the corresponding quadratic progr am has the descr ibed pr o pe rty . 2282 P AIRWISES V M SANDL ARGES CALEP ROBLEMS α of the optimizatio n Th eor e m If2 I is a symm etric inde x set and(8) if holds , th en any solution pr oble m (9) lead s to a symm etric pair wis e decis io n function : X × Xf → . R α of (9) le t us defi ne Pr oof F or an y solution gα : X × X → gα ( a, b) ≔ ∑ R by α i j yi j K (( xi , x j ) , ( a, b) ) . ( i , j ) ∈I ( a, bas ) = gα ( a, b) + γ f or some appropria te Then, the obta in ed decis io n function can be w rfαitten 1 2 γ ∈ . I f α and α ar e so lu tio ns of (9) then gα 1 = gα2 c an be d e ri v ed by me ans of c on v e x optiαˆ of (9) w ithαˆ i j = αˆ j i for all mization theor yA. cc ording to Lemma 1 there is al w a ys a solution ( i , j ) ∈ I . Ob viously , such a solu tio n leads to a symmetric decision f ufαnc tio n fα is a ˆ . Hence, α. symmetr ic decision function for all solutions R 3.2 Balanced K er nels vs. Sy m metric T raining S e ts Section 2 sho w s th at on e c an use balan c ed k er nels to obta in a symme tric pairw ise decis io n functio by me ans of a pair wis e SV M. As deta iled in S e ctio n 3.1 this c an also be achie v e d by symme tric tr ain ing se tsNo . w , w e sho w in T heore m 3 tha t the d e cis ion function is the same , re g ardless w he ther a symme tric training set or a certain bala nc ed k er nel is used. This re sult is also of pra ctical v alu e, sinc e the a pp r oach w ith balanced k er n e ls leads to s ignificantly shorte r training times (see the empiric a l r esults in Section 4.2). SupposeJ is a lar gest subset of a gi v en symme tric inde I satis x set f ying (( i , j ) ∈ J ∧ j 6 i ) ⇒ = No w , we c o nsid er the optimization proble m ( j , i) ∈ / J. min H ( β) β s.t. 0 ≤ βi j ≤ 2C for all ( i , j ) ∈ J ∑ yi jβi j = 0 ( 10) ( i, j )∈J with H ( β) ≔ 1 βi jβ k lyi j yk lKˆ i j,k lo 2 ( i , j ) ,∑ ( k,l ) ∈ J ∑ βi j ( i, j )∈J and 1 ( 11) Kˆ i j,k l ≔ =Ki j,k l+ K j i,k l_, 2 w he re K is an arbitrar y pairw ise k e rnel. Ob viously Kˆ is a ,bala nce d k ernel. F or instance, K= K if D ˆ ˆ = = = (1) the nK KDL (4) or if K KT ( 2) thenK KT L (5). The assumed symmetryKofyie ld s Kˆ i j,k l = Kˆ i j,l k = Kˆ j i,k l = Kˆ j i,l k = Kˆk l,i j = Kˆ l k,i j = Kˆ k l, j i = Kˆ l k, j i. ( 12) N ote th at ( 12) hold s not only f o r k e rnels gi v en by ( 11) b ut for an y balanced k e rnel. 2283 BRUNNER, F ISCHER,L UIGANDT HIES Th eor e m Let 3 the fu nctio ns α :gX × X → gα ( a, b) ≔ hβ( a, b) ≔ and hβ : X × X → R R b e defined by ∑ α i j yi j K (( xi , x j ) , ( a, b) ) , ∑ βi j yi j Kˆ (( xi , x j ) , ( a, b)) , ( i , j ) ∈I ( i , j ) ∈J wh e r e I is a sym metr ic inde x set and J is defined asAdditionally abo ve . , le t K fulfi(8) ll an d ˆK be ∗ ∗ α of ( 9)and for any solu tioβ nof ( 10)it holds that αg∗ = hβ∗ . giv e n b(11) y . Then, for any solution Pr oof B y means of con v e x optimiz ation theory it can be deri vgαedis th the atsame functio n f or α . nThe sa me hold s h β. Hence, an y s o lu tio for n due to Lemma 1 w e can assume β and an y s o lu tio α ∗i j = α ∗j i. F orJR ≔ IR a ndJN ≔ J \ JR w e defin β¯e by that α ∗ is a solution of (9) w ith _ α ∗i j + α ∗j i if ( i , j ) ∈ JN , β¯i j ≔ α ∗ii if ( i , j ) ∈ JR. ∗ α ∗iby O b viouslyβ¯,is a f easible point of (10). Then, by (11) and j = α j i we obtain f or α ∗i j + α ∗j i β¯i j ¯ ˆ =Ki j,k l+ K j i,k l_ βi j Ki j,k l = ( Ki j,k l+ K j i,k l) = 2 2 = α ∗i j Ki j,k l+ α ∗j iK j i,k l, β¯i i β¯ii Kˆ ii ,k l = ( Kii ,k l+ Kii ,k l) = α ∗ii Kii ,k l. 2 ( i , j ) ∈ JN : ( i , i ) ∈ JR : ( 13) Then,yi j = y j i imp lies hβ¯ = gα ∗ . ( 14) I n a second ste p we p r o vβ¯eistha a solutio t n of pr oble m ( 1 0) . B y kusing l = yl k, th e symme try ¯ of K , (13) , (12), and th e definitionβ of one obta in s 2G( α ∗ ) + 2 ∑ α ∗i j ( i, j )∈I = ∑ ( i , j )∈I = ∑ α ∗i j yi j ∑ ( k,l ) ∈ JN α ∗i j yi j ( i , j ) ∈ JN ∪JR = ∑ = 2H ( β¯) + 2 ∑ ∑ β¯k lyk lKˆ i j,k l+ ∑ ∑ α ∗j iy j i ( i , j ) ∈ JN β¯k lyk lKˆ i j,k l+ ( k,l ) ∈ J ! ∗ yk kα k kKi j,k k ( k,k) ∈ JR ( k,l ) ∈ J β¯i j yi j ( i , j ) ∈ JN ∑ ∗ ∗ yk l=α k lKi j,k l+ α l kKi j,l k_ + ∑ β¯ii yi i ( i ,i ) ∈ JR ∑ β¯k lyk lKˆ j i,k l ( k,l ) ∈ J ∑ β¯k lyk lKˆ i i,k l ( k,l ) ∈ J β¯i j . ( i , j ) ∈J Then, the definition ofβ¯ imp lies ∗ G( α ) = H ( β¯) . 2284 ( 15) P AIRWISES V M SANDL ARGES CALEP ROBLEMS α¯ by N o w , let us define β∗i j / 2 if ( i , j ) ∈ JN , α¯i j ≔ β∗j i/ 2 if ( j , i ) ∈ JN , β∗ii if ( i , j ) ∈ JR. O b viouslyα¯,is a f easible poin t of (9) . Then, by ( 8 ) and (11) w e obta in for β∗ ( k, l ) ∈ JN : α¯k lKi j,k l+ α¯l kKi j,l k = k l( Ki j,k l+ Ki j,l k) = β∗k lKˆi j,k l, 2 β∗k k ( k, k) ∈ JR : α¯k kKi j,k k= ( Ki j,k k+ Ki j,k k) = β∗k kKˆ i j,k k. 2 This , (12), and yk l = yl k yield ∗ 2H ( β ) + 2 β∗i j ∑ ( i , j ) ∈J = ∑ 1 β∗k lyk l =Kˆ i j,k l+ Kˆ j i,k l_ + 2 ( k,l ) ∈ JN β∗i j yi j ∑ ( i , j ) ∈J = 1 β∗i j yi j 2 (i ,∑ j )∈J ∑ ! 1 β∗k kyk k =Kˆ i j,k k+ Kˆ j i,k k_ 2 ( k,k) ∈ JR ∑ ! α¯k lyk l=Ki j,k l+ K j i,k l_ . ( k,l ) ∈ I Then, the definition ofα¯ pro vide βs∗i j = α¯i j + α¯ j i f or( i , j ) ∈ JN andα¯i j = α¯j i. Thus, ∗ 2H ( β ) + 2 ∑ ( i , j ) ∈J β∗i j = ∑ (i , j ) ∈I α¯i j yi j ∑ ( k,l ) ∈ I ! α¯k lyk lKi j,k l = 2G( α¯) + 2 ∑ α¯i j (i , j )∈I follo ws. This implie sG( α¯) = H ( β∗ ) . No w , let u s ass u meβ¯that is not a solutio n of (10)Th . e n, H ( β∗ ) < H ( β¯) hold s and, by ( 15 ) , w e ha v e ∗ ∗ G( α ) = H ( β¯) > H ( β ) = G( α¯) . α ∗of This is a c o ntradictio n to the optimality . Hence,β¯ is a solution of (10) and hβ∗ = hβ¯ follo w s. Then, w ith (14) w e ha v e the desired r esult. 4. Impl ementa tion O ne of the mo st widely used techniques f or solvin g S VMs ef ficie ntly is the sequential minimal optimiz ation (SMO ) (Platt, 1999). A w ell kno wn imp leme ntatio n of this te chnique is LIBS VM (Chang and L in , 201 E 1)mpirically . , S MO sca le s quadratically with the number of tr ain ing p oin ts (Pla tt, 1999).N ote th at in pairw ise cla ssifi catio n the train ing p oin ts are th e tr a ining If all pair s. possible tr ain ing pair s ar e used, then the number of trainin g pairs gro w s quadratically w ith the numberm of tr ain ing e xampleHence, s. th e runtime of L IBSV M w ould scale quar tic a lly m. w ith In Section 4.1 w e d iscuss ho w the costs for e v a luating pairwis e k er nels , whic h can be e xpr essed by standard k ernels, can be drastically reduced. In Section 3 w e dis cussed that one c an either use balanced k er n e ls or symmetric tr ain ing sets to enf o r ce the symmetr y of a pairwis e de cis ion function. Ad ditionally w , e s h o w ed th at both approaches lead to th e same decis io n Section functio n.4.2 compar es th e nee ded tr a ining times of the approach w ith balanced k er nels and the approac h w ith symmetr ic trainin g sets. 2285 BRUNNER, F ISCHER,L UIGANDT HIES 4.1 Cach in g the Standard K er nel In this su bse ctio n balanced k ernels ar e u se d to e nforce the symmetr y of th e pairw ise decis io n f unc tion. K er nel e v aluations ar e cr ucia l for the perf orma nce of L I B SV M. If w e could c ache th e whole k er n e l ma trix in R AM we w o uld get a huge incr ease of speed. T oday , this se ems imp ossib le for sig ,250 tr ain ing p a ir s as stor ing the (symmetr ic) k er nel ma trix for this nu mb er nificantly mo re than 125 of pair s in d ouble precision ne eds approximate ly 59G NoteB.th at tr ain ing sets with 500 tr a ining e xamp le s alr e ady re su lt,250 in 125 tr a ining p a irNo s. w , w e de scribe ho w the costs of k ernel e v aluations can be drastically reduceFd.or e xa mple , le t us se le ct theKkT eL (5) rnelw ith an ar bitrary stan da rd k ernel. F or a sin gle e v aluation KTofL the sta ndar d k er nel has to b e e v alu ate d four times with v ector s of X. A f te rw ards, four arithme tic oper atio ns are needed. I t is ea sy to see that each sta ndar d k ernel v alue is used for e v alu atin g ma n y dif f erent ele ments of the k ernel matr ixIn. general, it is possible to cache the sta ndard k erne l v alues for a ll tr a ining e xamp le s. F or e xample , to c ache the sta ndar d k ernel,000 v alues e xamples for 1 0o ne nee d s 400MB. Thu s , each k ernel e v aluation KT L of costs four ar ith metic o pe ration s only . T his d oe s n ot de p e nd on the chosen s ta ndard k ernel. T able 1 c ompares the trainin g times w ith and with out caching th e stand a rd k eFrnel or v alues. these measurements e xamples f rom the doub le inte r v al ta s k (c f. Section 5.1) ar e use d where each class is r epresented by 5 e xa mple KT L iss,chosen as pairwis e k er nel with a lin e ar stan da rd k ernel, a cache size of 100MB is sele cted for ca ching pair wis e k e rnel v alues, and all possible pair s a re used for tr ain ing. In T able 1a the trainin g set of each run c on 250of e xample s of 50 cla sse s w ith m =sists dif fer ent dimensions s of dimension n. T able 1b sho w s results for dif fere n t number m of e xamples n = 500. Th e spee d up f acto r by the describe d cac h in g te c h niq ue is up to 100. D imension S tandard k e rnel (time in mm:ss) n of e x a mple snot c ached cached 200 400 600 800 1000 2:08 4:31 6:24 9:41 11:27 N umber Stand a rd k e rnel (time in hh :mm) m of e xamples no t ca chedcached 0:07 0:07 0:07 0:08 0:09 200 400 600 800 1000 (a) Dif fe rent dime ns io nsf e x a m p le s no 0:04 1:05 4:17 12:40 28:43 0:00 0:01 0:02 0:06 0:13 (b ) Dif fere n t numbers m of e xam p le s T able 1: T ra ining time with and with out caching the sta ndard k erne l 4.2 Balanced K er nels vs. Sy m metric T raining S e ts Theorem 3 sho ws that pa ir wis e SV Ms whic h use symme tric tr ain ing sets and pairw ise S VMs w ith balanced k ernels lead to the same decis io n functio F orn.symme tric trainin g s ets the numbe r of training pairs is near ly double d compare d to the number in the case of balan c ed S imultak ernels . neously , (11) sho ws tha t e v alu atin g a balanced k er nel is c o mp uta tionally more e xpensi v e compa r to the corresponding non balan c ed k er n e l. 2286 P AIRWISES V M SANDL ARGES CALEP ROBLEMS T able 2 compar es the needed trainin g time of both approac h e s. T her e, e xamples from the doub le interv a l ta sk ( cf. S ec tio n 5.1) o f dime n =nsio 500nare u se d w her e e ach cla ss is re p r esente d by 5 e xamp le Ks,T a n d its balanced v er sion KT L w ith linear stand a rd k ernels are chosen a s pairw ise k er n e l, a cache size of 1 00MB is se le cted for caching the pair wis e k ernel v alue s, and all possible pairs are used for trainin g . I t tur n s out, that the appr o a ch with balanced k ernels is thre e to four times f a ste r th an using symme tric trainin gOfs course, ets . the technique of caching the s ta ndard k er n e l v alu es a s desc ribed in Section 4.1 is used within all me asure me nts . Numberm Symme tric training setBala nc ed k ernel of e xamp le s (t in hh:mm) 500 1000 1500 2000 2500 0:0 3 0:4 6 3:2 6 9:4 4 23:1 5 0:01 0:17 0:56 2:58 6:20 T able 2: T ra ining time for symmetr ic tr ain ing sets a nd for bala nce d k ernels 5. Classificat i o n Exper iments In this section we w ill present results of applying pairw ise S VMs to one sy nth etic data set and to one real w orld data set. Befor e we come to those data sets in Sections 5.1 and 5.2 we KTl inin L tr oduce po l y and KT L . Those k e rnels denote KT L (5) with linear sta ndard k ernel and homogenous polynomial pol y pol y lin l in stan da rd k ernel of de gree tw o, respecti The v ely k ernels . KM KT M are defined L, KM L , KT M, a n d analogously . In the follo wing, detectio n err or trade-of f curv es (D ET curv es c f. Gama ssi e t al., 2004) w ill be u s ed to me a sure the p e rformance of a p a ir wis e c la ssifier . Such a cur v e sho ws for a n y f a ma tch r ate (FMR ) the corresponding f a ls e non matc h rate (FNMR ). A specia l point of in te re st of such a curv e is the (appr o ximated) equal er ror rate ( E ER ) , that is the v alue for w hich FMR =FN MR holds. 5.1 Dou ble Inter v al Task Let us descr ibe the double in te rval ta of skdimensionn. T o get s u c h an e xample x ∈ {o 1, 1} n one dra w is, j , k, l ∈ so that 2≤ i ≤ j , j + 2 ≤ k ≤ l ≤ n and defines N xp ≔ _ 1 p ∈ { i ,..., j } ∪ {k,..., l } , o 1 otherwis e. ) ≔ ( i , k) . Note that the pair( j , l ) does not influence The classc of such an e xample is gi v en c( xby o 3) ( n o 2) / 2 cla sses . the class. H ence, the re( n are F or o ur me asure me nts we selecte n = 500dand te sted all k ernels in (4)–(7) w ith a linear standard k erne l and a homogenous polynomial standard k er nel o f de gre e tw o, r especti v ely . W e cr eate d a te set consistin g of 750 e xamples of 50 c la sses so th at each c la ss is repr esente d by 15 e xample s. An y training set w as generated in such a w ay that th e set of classes in the tr ain ing set is dis join t from the 2287 BRUNNER, F ISCHER,L UIGANDT HIES 1 1 lin 50 Classes K ML lin 100 Classes K ML lin 200 Classes K ML poly K TM 50 Classes poly K TM 100 Classes poly K TM 200 Classes 0.9 0.8 0.7 0.8 0.7 0.6 FNMR FNMR 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0.001 lin K ML poly K ML lin K TL poly K TL lin K TM poly K TM 0.9 0.01 0.1 1 FMR 0 0.001 0.01 0.1 1 FMR (a) Dif fe r e n t c lass n umbe r s in tra inin g (b) D if fere n t k ernels f or 2 00 cla s se s in tra inin g F ig ure 1 : DE T curv es for doub le in te rv al ta sk set of cla ss es in the te st set. W e cre ate d training se ts consistin g of 50 classes and dif fe rent n umb ers of e xamples per class. F or tr ain ing all po ss ible training pair s w er e use d . W e observ ed th at an inc reasing number of e xamp le s per class impro v es th e perfor ma nce in dependently of th e o th er para me te r s. As a tr ade -of f between the needed tr ain ing time and p e rformanc of the classifier , we decided to use 15 e xamples pe r c la ss for th e me asur I n de eme p e nts nd .e ntly of the se le cted k ernel, a penalty paramete C of 1,r000 tu rned out to be a good choic e. T he kKeDSrnel led to a bad perfor ma nce r e g ar dle ss of the sta ndard k ernel Theref chosen. o r e, w e omit results f or KDS. Figure 1a sho ws that an increa sing numb er of c la sses in the training set impro v es th e per for ma nce significantly . T his holds for a ll k e rnels mentio ned a bo v e. H ere , we only prese nt results f or pol y l in KM L andKT M . F igur e 1b sho ws the D ET c u r v es for dif f erent k e rnels w he re the tr a ining set consis t of 20 0 cla sses . In p a rticula r , an y o f the pair wis e k ernels whic h uses a homogeneous p oly no mial o de gr ee 2 as standard k ernel le ads to b e tte r re sults than its corr espon din g counte r part w ith a lin e a p ol y stan da rd k e rnel. F or FMR s sma ller than 0.07 KT M leads to the best results, w he reas for la r ger pol y pol y po l y FMR s the D ET curv e KsMofL , KT L , andKT M intersect. 5.2 Lab e le d F aces in t h e W ild In this subsec tio n w e will pr esent results of applying pairw is e SVMs to th e labele d f a ces in the wild (L FW) data set (Huang et al., 2007). This data set consis,233 ts ofimages 13 of ,5749 pers o ns. Se v e ral remarks on th is d a ta set are in order . H uang et al. ( 2 007) suggest tw o protocols for per for ma nce me a sureme nts. Here , the unrestric ted pr otocol is u s ed. This protoc o l is a fi x e d te nf old c ro v alidation where ea ch test set consists of 300 p ositi v e pairs and 300 ne g ati v e p a ir s. Moreo v er , an person (class) in a training set is not par t of th e cor responding test set. There ar e se v er al fea ture v ecto rs a v ailab le for the LF W data set. F or the p r esente d measureme w e ma in ly f ollo w ed L i et al. (2012) a n d used the sca le -in v aria nt f eatur e transform ( S I F T)-bas ed fea ture v e ctors for the f unn e le d v er sion (Guillaumin e t al., 2009) of LFW. I n addition, the aligned × 150 pix images (W olf e t al., 2009) a re used. F or this , the aligned images are cropped to 80 e ls and are then normalized by passing th em thr oug h a log function (c f. L i et al.,A2012). f te rw ards, the 2288 P AIRWISES V M SANDL ARGES CALEP ROBLEMS 1 0.9 lin K ML poly K ML lin K TL poly K TL lin K TM poly K TM 0.9 0.8 0.7 0.7 0.6 FNMR FNMR 0.6 0.5 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0.001 SIFT LBP TPLBP LBP+TPLBP SIFT+LBP+TPLBP 0.8 0.01 0.1 1 FMR 0 0.001 0.01 0.1 1 FMR p ol y ( a ) V ie w 1 p a rtition , dif fere n t k e r ne ls, a d de d up e c isio icted n (b )dUnrestr p rotocKoT l,M , dif ferent f e atu re v ec to r s, f unction v a lue s o f S IFT , L P B , a nd T PLB P fe atu“ +r e v ec tors ” stand s for add in g up th e corr e sp ond in g d e cis io n f unction v alu e s F ig ure 2 : DE T curv es for L FW data set loca l bin ary patterns (L BP) (Oja la et al., 2002) and three- patc h L BP ( T PLB P) (W olf et al., 2008) are e xtracted. In contr a st to Li et a l. (2012), the pose is neith er estimated nor sw a p pe d and no PCA is applied to the data.A s the norm of the L BP featur e v ec tors is not th e same for a ll images we scaled th em to Euclid e an norm 1. F or mode l se le ction, the V ie w 1 par titio n of the L FW data bas e is recommended (Huang et al., 2007). U sin g all possible pairs of this partitio n f or trainin g and for te sting, we obtained that a penalty pol y para me te C rof 1,000 is suita ble.Mor eo v er , f or e ach used featu re v ector , K the ernel ads to T Mk le the be st r esults among all used k e rnels and als o if sums of decision f u nc tio n v a lues belo ng in g to SIFT , LB P, and TP LBP f eatur e v ec tors are F or used. e xample, Figure 2a sho w s the pe rformance of dif fere nt k ernels, w he re the decis io n function v alues corr esponding to SIFT , LB P, and T PLB P fea ture v ector s a re added up. Due to the spe ed up te chniques pr esente d in S ec tio n 4 w e w ere able to tr ain w ith la r ge n umb er of tr ain ing pairs. H o we v er , if all pair s w e re used for tr a ining, then a n y tr a ining set w ould consis t o ,000,000 pairs and the tr ain ing w ould still nee d too muc hHence, approximate ly 50 time. w her eas in an y train ing set all positi v e trainin g pairs were used, the ne g a ti v e tr ain ing pairs w er e r ando mly ,000,000 selecte d in such a w ay th at an y trainin g set consis ts ofpairs. 2 The trainin g of such a mo del took less than 24 ho ur s on a sta ndard PC. In Figure 2b w e present th e a v er age D ET curv es obtained pol y for KT M and f eature v e ctors based on SIFT , L BP, and TPL BP. Inspired b y Li et al. ( 2 012) , we determin e d tw o f urther DE T curv es by a d din g up th e dec is ion function v a lues. This le d to v ery goo results. Further more, w e concatenate d th e SIFT , LBP, a n d T PLB P f eature v ec tors. Surpris in gly , th training of some of those models needed lo nger than a week. Therefor e, we do not prese n t th ese results. I n T able 3 the mean equal e rror rate (EE R) a nd the standard er ror of the me an (SEM) o btained fr o m the te nfold cr oss v alidatio n are pr o vided for se v eral ty pes of f eatur e v ectors. Note , that ma n y of our re sults ar e compar able to the state of the art or e v en better . The curr ent sta te of the ar t can be found on the homepage of Huang et al. ( 200 7) a nd in th e publication of Li et al. (2012). If only SIFT± 0is .0040 based fea ture v ectors a re used, then the best kn o w n.125 re sult 0 (EE R± SEM). W ith 2289 BRUNNER, F ISCHER,L UIGANDT HIES .1252± SE pairw ise SV Ms w e achie v ed th e same EE R b ut a slig htly higher 0.0062. M 0 If w e add up the decision function v alues corr espon din g to th e LBP and T PL BP featur e v ector s, then our result .1050 0.1210± 0.0046 is w orse compare d to th e sta te of th e art 0± 0.0051. O ne possible reason for th is f act might be that w e did not sw ap the pose. Finally , for the added up dec is ion function .0947 v alu es corre spon din g to S I F T , L BP a n d TP LBP f eature v ector s, our per± forma 0.0057nce 0 . ± . is better th an 00993 0 0051. F ur thermo re, it is w orth noting th at our stand a rd e rrors of th e me an are compara b le to th e other pr esente d le ar n in g algorithms although mos t of them use a PC A to reduce nois e and dimension of the featur e v ectors. N ote that the re sults of the comme r cia l system are not directly compar able since it use s outside tr a ining data (for r efer ence s ee Huang e t al., 20 07) . S IFT LB P TP LB P L+ T S+L +T CS P airw ise Mean E ER 0.1252 SVM SE M 0.0062 0.1497 0.1452 0.1210 0.0947 0.0052 0.0060 0.0046 0.0057 - State of Mean E ER 0.1250 the A r t SE M 0.0040 0.1267 0.1630 0.1050 0.0993 0.0870 0.0055 0.0070 0.0051 0.0051 0.0030 T able 3:Mea n EE R a nd SE M for LF W data set. S=S I F T , L=LB P, T=T PL BP, +=adding up decision f u nc tio n v alues , CS=C ommercialface.com system r2011b 6. Fi na l Remarks In this p a per we suggested the S VM fr ame w or k for handlin g lar ge p a ir wis e cla ssific atio n p r oble m W e analy zed tw o approaches to enforce the symmetr y of the o btained claTssifie o the rs. best of our kno w ledge, w e g a v e the first proof th at symme try is indeed achie v ed. Then, we pro v ed that f or each par ame ter set of one approa ch there is a cor responding para me ter set of the oth er one such th at both approaches le ad to th e same classifier Additionally . , we sho w ed that the approach b a sed on balanced k er n e ls le ads to shor te r training times. W e discussed deta ils of the imp leme ntatio n of a pairw ise SVM solv er and pre sente d numerical results. Those results demonstrate th at pairw is e SVMs are capable o f successfully treatin g lar ge scale pa ir wis e classification pr o blems . F ur thermor e, we sho w e d tha t pair wis e SV Ms compete v er well for a r eal w or ld data se t. W e w ould lik e to under lin e that some of the dis c u ss ed techniques could be transfer red to oth er approac h e s for solving pairw ise classification p r oble F orms e xample, . most of th e results can be applied ea sily to One Cla ss Supp or t Vec tor Ma chines (S c h ¨o lk opf et al., 2001). Ackno wledgments W e w ould lik e to th ank the unkno w n refe rees for their v aluable comments and suggestions. 2290 P AIRWISES V M SANDL ARGES CALEP ROBLEMS Ref e r e nces J. Ab e rneth y , F . B a ch, T . Evg e niou, a n d J.-P. Ver t. A ne w appr o a ch to collaborati v e fi lter ing: O ator estimation with spec tr al re gu larization J ournal . of Mac hine Learning Resear , 10:803–826, ch 2009. A. B ar -Hillel a n d D. W einshall. Learnin g dista nce f u nc tio n by coding simila r ity . In Z. G hahr ama ni, editor Pr , oceedin g s of the 24th Inte rnational C onfe r ence on Mac hin e Learning , pages (ICML ’ 07) 65–72 . A C M, 2007. A. B ar -Hillel, T . H e rtz , and D . W einshall. Boostin g ma r gin base d dista nce functions for cluster ing. I n C. E. Brodle y , editor In ,P r ocee d in gs of the 21st Internatio nal Confe r e nce on Mac hine , pages 393– 400. A CM, 200 4a . Learning (I C M L ’04) A. B ar -Hillel, T . H er tz , and D. W ein shall. L e arning dis tance functions for imagePrretrie o- v al. In ceedin g s of the IEEE Com puter Socie ty C onfer ence on C ompute r V is io n an d P attern Reco gnition (CV PR ’04) , v olume 2, pages 570–577. IEE E Computer Society Press, 2 004b. A. B en- H ur and W. Sta f ford Nob le. K ernel me thods for predicting protein–protein inte r actions. , 21(1):38–46, 2005. B io informatics C. B r unnerA. , Fis cher K , . L uig, and T . Thie s.P air wis ke ernels,support v ectorma chine s, and the application to la r gescale problems .T echnic aReport l MAT H-NM-04-2011, I n¨at stitute o f Nu mericalMath ema tics,T echnis c hUeni v er sitDresden, O ctobe r2011. UR L . http://www.math.tu-dresden.de/˜fischer C.-C. C hang and C .-J. L in . LIBSV M: A lib r ary f or su ppor t v e ctor ma chine s. A CM T r ansactionson Intellig ent Sy ste msand Tec hnolo ,gy 2(3):1 –2 6, 2011. UR L . http://www.csie.ntu.edu.tw/˜cjlin/libsvm (August 2011) K . Du a n a n d S. S. K eerthi. Which is the b e st multic la ss SV M method? A n empiric a l study . In N. C. O za , R. Po likar , J. Kittle r , and F . Ro li,Peditors, r oce edings of th e 6th I n ternatio nal Wo r kshop on , pages 27 8–285. Springer , 2005. Multiple C la ssifier System s M. Gama ssi, M. Lazza roni, M. Mis in o , V. Piuri, D. S a n a , and F . Scotti. A cc urac y and pe rformance of biome tric syste ms.Pr In oceedings of the 2 1th I E EE I nstr um e ntatio n and Me asur em ent Tec hnolo gy C onfer ence ( IMTC, ’04) pages 510–51 5. IEEE , 2 004. M. Guillaumin ,J. Verbeek,and C. Schmid .Is tha tyou? Metr ic le a rningapproache for s f ace identificatio n. InP r ocee d in gs of the 12th Interna tional Confe r ence on Com puter V is ion (ICC V ’09), pages 4 98–505, 2009. U RL http://lear.inrialpes.fr/pubs/2009/GVS09 (August 20 11) . S. I. Hill and A . Doucet. A fr ame w or k f or k ernel-based multi- cate gory cla ssific atio of n. J ournal , 30( c1 h) :525–564, 2007. A r tifi cia l Inte llig enc e Resear C.-W. Hsu and C.- J. Lin. A c ompariso n o f me th od s for multiclass support v ectorIE ma EEchin es. T r ansactio ns on Neur al Networks , 13( 2):415–425, 2 002. 2291 BRUNNER, F ISCHER,L UIGANDT HIES G. B. Huang, M. Rame sh, T . Ber g, and E. L ea rned-Miller . Labeled f a ces in the wild: A database f or studying f ace rec og nition in unconstraine d e n vironme nts . T echnic a l R e p or t 07-49, Uni v ersity of Massachusetts, Amher st, October 2007.http://vis-www.cs.umass.edu/lfw/ UR L (August 20 11) . P. Li, Y. Fu, U. Moh a mme d, J. H . E lde r , and S . J. D . Prince . P r obabilistic models for infe rence abou identity .IEEE T r ansactions on P atte rn A nalysis and Mac h in e I, nte 34:1llig4 ence 4–157, 2012 . T . Oja la, M. Pietik ¨ain en,and T . M ¨aenp ¨a Multir ¨a. esolu tio ngra y-scale and rotatio n in v ariant te xtur e classification with lo cal bin ary pa tte r ns.I n IEE E T r an sac tio ns on P attern Analy sis a nd Mac hine Intellig ence , 24(7) :971–987 ,2002. UR L . http://www.cse.oulu.fi/MVG/Downloads/LBPMatlab (August 2011) P. J. P hillips. Support v e ctor machines applied to f ace recognition. In M. S . K earns, S. A. Solla , and D . A. C ohn, edito rAs,dvances in Neur al Informatio n Pr ocessing System , page s 11s 803–809. MIT Pres s, 19 99. J. C . Pla tt.F a sttrainin gof support v ec tormachinesusin gsequentialminimal optimiz a tio n. In B . Sch¨ olk o pf , C. J. C. B ur g e s, and A . J. S moAladvance , e d itors, s in K ernel Meth od Support s: , pages 185–208. MIT Press, 1999. Vector Learning R. R ifkin a n d A . K lau tau. In d e fense of one-vs-all cla ss ificatio n. of Mac hine Learning J ournal 200 4. R e sear, c5:101–141, h . B. Sch ¨olk opf and A . J. Smola Learning with K ernelsSu : pport Vecto r Mac hin es, Re gula riz ation , O ptimizatio n, and Be yond . MI T Press, 2001. B. S c h ¨olk opf, J. C. Platt, J. Sha we-T a y lo r , A . J. Smola, and R. C . W illiams on. Estimating th e s u p of a high-dimensional distr ib utio Neur n. al C omputations , 13( 7):1443–1471 , 2001. J. P. Vert, J. Qiu, and W. No ble. A ne w p a ir wis e k er nel for biological netw ork infer ence w ith s u pport v ecto r machines. BMC Bioinformatics , 8(Supp l 10):S8, 2007. L. W ei, Y. Ya ng , R . M. N ishika w a, and M. N . W ernic k. Learnin g of perceptua l similarity fr om e xpert re aders for mammo gra m r etr iePvr al. In oceedin gs of the I E EE Internatio nal Sym posiu m , pages on Biomedic al Ima ging ( ISB I ) 13 56–1359. IEE E, 2006. L. W o lf, T . Hassner , and Y. T aigman. Descrip tor based methods in the F awild c es. Iinn Real- Life Ima g es Work sho p a t the E ur opean Co nfer ence on Com pute r V ision, (ECC 2008. V UR’08) L . http://www.openu.ac.il/home/hassner/projects/Patchlbp (August 2011) L. W olf , T . H assner , and Y. T aigman. Simila r ity scor es based on background P rsamples. oce ed- In v olu 2, pages 88–97, 2009. ings of the 9th Asian Confe r ence on Com puter V ision ( A, C CVme ’ 09) 2292