Pairwise Support Vector Machines and their Application to Large Scale Problems

J ou rnal of Machine L ea rning Rese arch 1 3 (201 2) 2 279-22 92 S u bmit ted 8/11; Re vised 3/12; P u bli sh ed 8 / 1 2 P airwise S up port Vec t or Mac hin e s and th e i r A pp lication to Lar ge Scale Pr oblems C ar l Brun ner A ndr eas F isch e r C. BRUNNER@ GMX. NET ANDREAS. FISCHER@TU- DRESDEN. DE I ns tit u t e for Numeri cal Mathematic s Te c hn i sc he Un i¨ ve r s it a t Dr e sden 01062 Dr es d e n , Ger ma ny Klau s L uig Th orst en T hies LUIG@ COGNITEC. COM THIES@COGNITEC. COM Co gn i te c Sys te m s Gmb H Gr os senha i n er Str. 10 1 01127 Dr es d e n , Ger ma ny Editor: C ori n na C ort es Abst r a ct )be P a irwi se cl as si fi c ati o n i s the t as k t o pre d ic t whet ha,eb rotfhaepea(xiraa, bmp l es lon g to the sa m e cla ss o r t o d i f f er en tIncla p ar ss tiescular . , inter cl as s g e n e ral iz at io n prob l ems c an b e t re ate d in this w a yIn . pai rwis e cl as si fi c ati o n, th e order of the tw o inp ut e xample s s h ou l d no t af fe ct the cla ss ifica ti o n re sult . T o ac h i e v e t h i s, p a rti cula r k er n el s as w e ll as the use o f s ymmetr ic tr aining s in th e fr ame w ork o f sup po r t v ec tor ma ch i n e s were sug ges te d . The paper d i sc u s se s bo t h a p proa in a gen e ra l w ay a n d e st abli shes a s tr o ng con nect ion bet In aween d dit them. ion , an e f fici ent imp l ementa ti o n i s dis cuss ed wh i ch a llo ws the t ra ining of s e v er al mil li o ns of pai rs . Th e v alue o f thes e co ntr ib u t ion s is con firme d by e x c el lent res ults o n the la b e le d f ac es i n t h e w i ld b e n c h mar k . K e y w ords:p a ir w i se sup po r t v ect o r mac h i n e s, i n t erc la ss g ener al iza ti o n, p a ir w i se k e rnels , la sca le p r o blems 1. Intr o duction T o e xtend binar y classifier s to mu lticla ss classification se v e ral modifica tio ns ha v e bee n suggested, for e xample the one ag ainst all technique, the o ne a g ainst one te c hn iq ue, or directed ac yclic gra ph s see Duan and K eer thi (2005), H ill and D oucet ( 2007 ) , H su and L in (2002), and R ifkin and K lauta u (2004) for furth er information , dis cussio ns, and c ompariso ns. A more rece nt appr oach used in the field o f multic la ss and binar y classification is pairw ise classification ( A berne th y et al., 2009; Bar H ille l et al., 2004a,b; Bar -H ille l and W einsha ll, 2007; Ben-H ur and Nob le, 2005; P hillips, 1999; Vert et al., 200 7)P. air wis e classification relies on tw o in pu t e xamples in ste ad of one and predicts w he ther th e tw o input e xamples belong to the same c la ss or to dif fer ent cla sses. This is of particular adv a n tage if only a s u bse t of cla ss es is kno w n f or trainin g . F or la ter use, a support v ecto r machin (SV M) th at is a b le to handle pairw is e classification tasks is called pairw is e SV M. A natu ral requireme nt for a p a ir wis e cla s sifier is that the or der of the tw o in p ut e xample s should not in fl uenc e the cla ss ificatio n result ( symme A c try). ommo n approac h to e nforce this symme try is the use of selecte d k ernels. F or pairw ise SVMs, another approach w a s suggested. Bar -H ille l c 2012Carl B run ner , Andre as F isch er , Klau s L uig and T ho rst e n Thies. BRUNNER, F ISCHER,L UIGANDT HIES et a l. (2004a) propose th e use of trainin g sets with a symme tric str uc W ture. e w ill dis c uss both approac h e s to obtain symmetry in a genera Bl wa ased y . on this, we will pro vide conditions w hen these a p pr oaches le a d to th e same classifier Mor eo v. er , w e sho w empiric ally that th e approach of using sele cted k e rnels is thre e to four times f aster in trainin g. A typical pairwis e c la ssifica tio n task ar is es in f ace r ecognition. Th e re, one is of te n in te re ste d in the interclass generalization, w her e none of the persons in the trainin g set is part of the test set. W e w ill demonstrate th at tr a ining sets with ma n y classes (per sons) are nee d e d to obtain a good perf orma nce in th e interclass gene ralization. Th e training o n such sets is computatio nally e x pe n s i v e Therefor e, we d iscuss an ef ficie nt imple menta tion of pair wisTh e SisVMs. enable s th e tr ain ing of pairw ise SV Ms w ith se v er al millio ns of pairs. In this w a y , f o r th e labele d f aces in the wild data base a performance is a chie v ed whic h is super ior to the current state o f th e ar t. This pape r is structur ed as follo Iws. n S e ctio n 2 we gi v e a short intr oduction to pairw ise classification a nd dis c u ss th e symmetry of decision function s obtained by pair wis e SV Ms. A f te rw ards, in Section 3.1 , we analyz e th e symme try of decision functions f rom pairw ise S VMs that rely on symmetr ic trainin g sets T .he ne w connection between th e tw o approa ches for obta in ing symmetry is establis hed in Section 3.2. T he ef fi c ie nt imp leme ntatio n of pairw ise SV Ms is dis cussed in Section 4. F in ally , we pr o vide performance me asure me nts in S ec tio n 5. The ma in c o ntrib utio n of the pa p e r is th at w e sho w the e qu i v ale nce of tw o approac h e s for ob ing a symme tric classifier f rom pair wis e SV Ms and demonstrate the ef ficie nc y and good interclass genera liz a tio n pe rformance o f pairw ise SVMs on la r ge scale problems . 2. P a i rwise Classificat i o n Let X be an ar bitrary set and let m training e xamples xi ∈ X w ith i ∈ M ≔ { 1,..., m} be gi v e n. The cla ss of a tr ain ing e xamp le migh t be un kno w n, b ut we d e ma nd that we kno w f or each pair ( xi , x j ) of training e xamples whether its e xamp le s belo ng to the sa me class or to dif fer ent c la sses. ( xi , x j ) belong to the same c la ss and Accordingly , we d e fiyne i j ≔ + 1 if th e e xamples o f th e pair o ( , call it a positive pair. O th erwis e, we yset . i j ≔ 1 and c allxi x j ) a ne gative pair I n pair wis e cla ssific atio n th e aim is to decide whether the e x a mple( sa,of b) a∈ pair X× X belong to the same class or not.In this paper , w e w ill mak e use of pairw ise decis ion functions f : X × X → . Such a function pre dic ts whether the e xamples a, b of a pair ( a, b) belo ng to the same cla ssf ( a, b) > 0) or no t (f ( a, b) < 0). Note that neithera, b need to be long to the set of training e xamples nor the cla s ses a, b of need to belong to th e classes of th e trainin g e xamp le s. × X → . Let H de n ote an ar bitrary re al A commo n to ol in machine learning a re k ker: X nels h ,∙∙uc i. Ft orφ : X → H , Hilb er t spac e with scalar pr od R R k( s, t ) ≔ hφ( s) , φ(t ) i defines astandar dk ernel. I n pairw ise cla ssifica tio n one oftenpair usewis s ek e rnels K : ( X × X ) × ( X × X ) → . I n this paper w e assume that an y pairw ise k er nel is symme tric , th at is , it holds that R K (( a, b) , ( c, d) ) = K (( c, d) , ( a, b)) for all a, b, c, d ∈ X , and that it is positi v e semid efinite ( S ch ölk opf a nd S mo la , 2001). F or in sta nce , KD (( a, b) , ( c, d)) ≔ k( a, c) + k( b, d) , KT (( a, b) , ( c, d)) ≔ k( a, c) ∙ k( b, d) 2280 ( 1) ( 2) P AIRWISES V M SANDL ARGES CALEP ROBLEMS are symmetric and positi v e semidefi nite. W e callKD dir ect sum pair w ise k e and rn e Kl T tensor (cf . S c h ölk opf and S mo la , 2001). pair wis e k ernel A natura l and desirable property of an y pairw ise decision functio n is that it should be symmetr ic in the follo w in g sense f ( a, b) = f ( b, a) f or alla, b ∈ X. ⊆ M × M is gi v en. Then, th e pairwis e decis ion function No w , le t us assume Ithat f obtained by a pairw ise SV M can b e written a s f ( a, b) ≔ ∑ α i j yi j K (( xi , x j ) , ( a, b)) + γ ( 3) ( i , j ) ∈I with bias γ ∈ and α i j ≥ 0 for all ( i , j ) ∈ I . Ob viously , ifKD (1) or KT (2) are used, th en the decision f u nc tio n is n ot symmetr ic in gener al. T his mo ti v ates us toKcall balanced a k e rnel if R K (( a, b) , ( c, d)) = K (( a, b) , ( d, c)) f o r a all, b, c, d ∈ X holds. T hus, if a balanced k e rnel is used, then (3) is a l w ays a symmetr ic decision F f uornc tio n. instance, the follo wing k er nels are bala nc ed 1 KDL (( a, b) , ( c, d) ) ≔ ( k( a, c) + k( a, d) + k( b, c) + k( b, d) ) , 2 1 KT L(( a, b) , ( c, d) ) ≔ ( k( a, c) k( b, d) + k( a, d) k( b, c)) , 2 1 2 KM L(( a, b) , ( c, d) ) ≔ ( k( a, c) o k( a, d) o k( b, c) + k( b, d) ) , 4 KT M(( a, b) , ( c, d) ) ≔ KT L(( a, b) , ( c, d)) + KM L(( a, b) , ( c, d) ) . ( 4) ( 5) ( 6) ( 7) Vert et al. (2007) callKM L m etric le arning pair wis e k erne and Kl T L tensor learning pair wis e k er ), learning nel. Simila rly , we call KDL , whic h w as intr oduce d in Bar -H ille l e t al. (2004 dir ectasum and KT M te nsor metr ic le arning pa irw ise k. eF rnel or re presenting some balanced pair wis e k ernel k er n e ls by proje c tio ns see B runner et al. ( 2 011). 3. Symmetric P airwise Decis io n Func t i o ns a nd P a i rwise SVMs P airw ise SVMs lead to dec is ion function s of the forAs m de (3).ta iled a b o v e, if a balan c ed k er nel is used w ithin a pairw is e SVM, one al w ays ob tains a symme tric de cis ionFfunction. or pairw ise SV Ms whic h use KD (1) as pair wis e k e rnel, it has bee n claimed tha t an y symme tric set of tr a ining pairs le ads to a symmetr ic decision function ( see B a r -Hillel e t al., 2004a). W e c all a set of tr a ining pairs symmetr ic, if f or an y tr a ining (pair a, b) th e pair( b, a) also belongs to the trainin g set. In Section 3.1 w e pr o v e the cla im of B a r - H illel et al. (2004a) in a more g e neral c o nte xt whic h inc lude KT (2). A dditionally , w e sho w in Section 3.2 that under some cond itions a s y mmetric tr a ining γ. bias term set leads to the same decision function as balanced k ernels if w e dis r e g ard the SV M Interestingly , th e applic a tio n of balan c ed k ernels le ads to significa ntly shorter training times (s ee Section 4.2) . 2281 BRUNNER, F ISCHER,L UIGANDT HIES 3.1 Symmet ric T raining Sets In th is subsec tio n we sho w tha t the sy mmetry of a pair wis e decis io n function is inde ed achie v ed by me a ns of symmetr ic trainin g sets. T o this end, letI ⊆ M × M be a sy mmetric inde x set, in oth er w ords if( i , j ) belongs toI the n( j , i ) also belongs toI . Further more, we w ill mak e use of pairw ise k er n eKlswith K (( a, b) , ( c, d)) = K (( b, a) , ( d, c)) f or alla, b, c, d ∈ X. ( 8) As an y pairw ise k ernel is ass u med to be symmetric , ( 8) holds f or an y balanced pairw ise k ernel. Note that th ere are o th er pair wis e k er nels that satisfy (8), for instance for the k ernels gi v en in Equations 1 and 2. F orIR, I N ⊆ I defined byIR ≔ { ( i , j ) ∈ I |i = j } a ndIN ≔ I \ IR let us consider the dual pairw ise SV M min G( α ) α 0 ≤ α i j ≤ C for all ( i , j ) ∈ IN 0 ≤ α i i ≤ 2C for all ( i , i ) ∈ IR ∑ yi j αi j = 0. s.t. ( 9) (i , j ) ∈I with G( α ) ≔ 1 αiα j k lyi j yk lK (( xi , x j ) , ( xk, xl ) ) o 2 ( i, j )∑ ,( k,l ) ∈ I ∑ α i j. ( i , j ) ∈I Lemma 1 I f I is a symm etric inde x set and (if8)holds, then ther e is a solu tioαˆnof ( 9)w ith αˆ i j = αˆ j i for a ll( i , j ) ∈ I . ∗ Pr oof B y the theore m of W eie rs tr ass ther e is a α solution o f ( 9 )L. e t u s define a n oth er feasible point α˜ of ( 9 ) by α˜ i j ≔ α ∗j i for all ( i , j ) ∈ I . F or easier notatio n w eKset i j,k l ≔ K (( xi , x j ) , ( xk, xl ) ) . Th e n, 2G( α˜ ) = ∑ α ∗j iα ∗l kyi j yk lKi j,k lo 2 ( i , j ) ,( k,l ) ∈ I ∑ α ∗j i. ( i, j )∈I N ote th at yi j = y j i holds f or a (lli , j ) ∈ I . By (8) w e fur ther obta in 2G( α˜ ) = ∑ α ∗j iα ∗l ky j iyl kK j i,l ko 2 ( i , j ) ,( k,l ) ∈ I ∑ α ∗j i = 2G( α ∗ ) . (i , j ) ∈I α˜ is als The last equality ho ld s sinI is cea symmetric tr a ining set. H ence , o a solutio n of (9). S in ce (9) is con v e x (cf. Sch ölk opf and Smola, 2 001) , α λ ≔ λα ∗ + ( 1 o λ ) α˜ ∈ [y0, 1]. Thus,αˆ ≔ α 1/ 2 has the des ir ed property . solv e s ( 9 ) f orλan N ote that a result similar to Lemma 1 is presented by W ei et a l. (2006) for Suppo r t Vector Regressio n.The y , ho we v e r , cla im tha t an y solu tio n o f the corresponding quadratic progr am has the descr ibed pr o pe rty . 2282 P AIRWISES V M SANDL ARGES CALEP ROBLEMS α of the optimizatio n Th eor e m If2 I is a symm etric inde x set and(8) if holds , th en any solution pr oble m (9) lead s to a symm etric pair wis e decis io n function : X × Xf → . R α of (9) le t us defi ne Pr oof F or an y solution gα : X × X → gα ( a, b) ≔ ∑ R by α i j yi j K (( xi , x j ) , ( a, b) ) . ( i , j ) ∈I ( a, bas ) = gα ( a, b) + γ f or some appropria te Then, the obta in ed decis io n function can be w rfαitten 1 2 γ ∈ . I f α and α ar e so lu tio ns of (9) then gα 1 = gα2 c an be d e ri v ed by me ans of c on v e x optiαˆ of (9) w ithαˆ i j = αˆ j i for all mization theor yA. cc ording to Lemma 1 there is al w a ys a solution ( i , j ) ∈ I . Ob viously , such a solu tio n leads to a symmetric decision f ufαnc tio n fα is a ˆ . Hence, α. symmetr ic decision function for all solutions R 3.2 Balanced K er nels vs. Sy m metric T raining S e ts Section 2 sho w s th at on e c an use balan c ed k er nels to obta in a symme tric pairw ise decis io n functio by me ans of a pair wis e SV M. As deta iled in S e ctio n 3.1 this c an also be achie v e d by symme tric tr ain ing se tsNo . w , w e sho w in T heore m 3 tha t the d e cis ion function is the same , re g ardless w he ther a symme tric training set or a certain bala nc ed k er nel is used. This re sult is also of pra ctical v alu e, sinc e the a pp r oach w ith balanced k er n e ls leads to s ignificantly shorte r training times (see the empiric a l r esults in Section 4.2). SupposeJ is a lar gest subset of a gi v en symme tric inde I satis x set f ying (( i , j ) ∈ J ∧ j 6 i ) ⇒ = No w , we c o nsid er the optimization proble m ( j , i) ∈ / J. min H ( β) β s.t. 0 ≤ βi j ≤ 2C for all ( i , j ) ∈ J ∑ yi jβi j = 0 ( 10) ( i, j )∈J with H ( β) ≔ 1 βi jβ k lyi j yk lKˆ i j,k lo 2 ( i , j ) ,∑ ( k,l ) ∈ J ∑ βi j ( i, j )∈J and 1 ( 11) Kˆ i j,k l ≔ =Ki j,k l+ K j i,k l_, 2 w he re K is an arbitrar y pairw ise k e rnel. Ob viously Kˆ is a ,bala nce d k ernel. F or instance, K= K if D ˆ ˆ = = = (1) the nK KDL (4) or if K KT ( 2) thenK KT L (5). The assumed symmetryKofyie ld s Kˆ i j,k l = Kˆ i j,l k = Kˆ j i,k l = Kˆ j i,l k = Kˆk l,i j = Kˆ l k,i j = Kˆ k l, j i = Kˆ l k, j i. ( 12) N ote th at ( 12) hold s not only f o r k e rnels gi v en by ( 11) b ut for an y balanced k e rnel. 2283 BRUNNER, F ISCHER,L UIGANDT HIES Th eor e m Let 3 the fu nctio ns α :gX × X → gα ( a, b) ≔ hβ( a, b) ≔ and hβ : X × X → R R b e defined by ∑ α i j yi j K (( xi , x j ) , ( a, b) ) , ∑ βi j yi j Kˆ (( xi , x j ) , ( a, b)) , ( i , j ) ∈I ( i , j ) ∈J wh e r e I is a sym metr ic inde x set and J is defined asAdditionally abo ve . , le t K fulfi(8) ll an d ˆK be ∗ ∗ α of ( 9)and for any solu tioβ nof ( 10)it holds that αg∗ = hβ∗ . giv e n b(11) y . Then, for any solution Pr oof B y means of con v e x optimiz ation theory it can be deri vgαedis th the atsame functio n f or α . nThe sa me hold s h β. Hence, an y s o lu tio for n due to Lemma 1 w e can assume β and an y s o lu tio α ∗i j = α ∗j i. F orJR ≔ IR a ndJN ≔ J \ JR w e defin β¯e by that α ∗ is a solution of (9) w ith _ α ∗i j + α ∗j i if ( i , j ) ∈ JN , β¯i j ≔ α ∗ii if ( i , j ) ∈ JR. ∗ α ∗iby O b viouslyβ¯,is a f easible point of (10). Then, by (11) and j = α j i we obtain f or α ∗i j + α ∗j i β¯i j ¯ ˆ =Ki j,k l+ K j i,k l_ βi j Ki j,k l = ( Ki j,k l+ K j i,k l) = 2 2 = α ∗i j Ki j,k l+ α ∗j iK j i,k l, β¯i i β¯ii Kˆ ii ,k l = ( Kii ,k l+ Kii ,k l) = α ∗ii Kii ,k l. 2 ( i , j ) ∈ JN : ( i , i ) ∈ JR : ( 13) Then,yi j = y j i imp lies hβ¯ = gα ∗ . ( 14) I n a second ste p we p r o vβ¯eistha a solutio t n of pr oble m ( 1 0) . B y kusing l = yl k, th e symme try ¯ of K , (13) , (12), and th e definitionβ of one obta in s 2G( α ∗ ) + 2 ∑ α ∗i j ( i, j )∈I = ∑ ( i , j )∈I = ∑ α ∗i j yi j ∑ ( k,l ) ∈ JN α ∗i j yi j ( i , j ) ∈ JN ∪JR = ∑ = 2H ( β¯) + 2 ∑ ∑ β¯k lyk lKˆ i j,k l+ ∑ ∑ α ∗j iy j i ( i , j ) ∈ JN β¯k lyk lKˆ i j,k l+ ( k,l ) ∈ J ! ∗ yk kα k kKi j,k k ( k,k) ∈ JR ( k,l ) ∈ J β¯i j yi j ( i , j ) ∈ JN ∑ ∗ ∗ yk l=α k lKi j,k l+ α l kKi j,l k_ + ∑ β¯ii yi i ( i ,i ) ∈ JR ∑ β¯k lyk lKˆ j i,k l ( k,l ) ∈ J ∑ β¯k lyk lKˆ i i,k l ( k,l ) ∈ J β¯i j . ( i , j ) ∈J Then, the definition ofβ¯ imp lies ∗ G( α ) = H ( β¯) . 2284 ( 15) P AIRWISES V M SANDL ARGES CALEP ROBLEMS α¯ by N o w , let us define β∗i j / 2 if ( i , j ) ∈ JN , α¯i j ≔ β∗j i/ 2 if ( j , i ) ∈ JN , β∗ii if ( i , j ) ∈ JR. O b viouslyα¯,is a f easible poin t of (9) . Then, by ( 8 ) and (11) w e obta in for β∗ ( k, l ) ∈ JN : α¯k lKi j,k l+ α¯l kKi j,l k = k l( Ki j,k l+ Ki j,l k) = β∗k lKî j,k l, 2 β∗k k ( k, k) ∈ JR : α¯k kKi j,k k= ( Ki j,k k+ Ki j,k k) = β∗k kKˆ i j,k k. 2 This , (12), and yk l = yl k yield ∗ 2H ( β ) + 2 β∗i j ∑ ( i , j ) ∈J = ∑ 1 β∗k lyk l =Kˆ i j,k l+ Kˆ j i,k l_ + 2 ( k,l ) ∈ JN β∗i j yi j ∑ ( i , j ) ∈J = 1 β∗i j yi j 2 (i ,∑ j )∈J ∑ ! 1 β∗k kyk k =Kˆ i j,k k+ Kˆ j i,k k_ 2 ( k,k) ∈ JR ∑ ! α¯k lyk l=Ki j,k l+ K j i,k l_ . ( k,l ) ∈ I Then, the definition ofα¯ pro vide βs∗i j = α¯i j + α¯ j i f or( i , j ) ∈ JN andα¯i j = α¯j i. Thus, ∗ 2H ( β ) + 2 ∑ ( i , j ) ∈J β∗i j = ∑ (i , j ) ∈I α¯i j yi j ∑ ( k,l ) ∈ I ! α¯k lyk lKi j,k l = 2G( α¯) + 2 ∑ α¯i j (i , j )∈I follo ws. This implie sG( α¯) = H ( β∗ ) . No w , let u s ass u meβ¯that is not a solutio n of (10)Th . e n, H ( β∗ ) < H ( β¯) hold s and, by ( 15 ) , w e ha v e ∗ ∗ G( α ) = H ( β¯) > H ( β ) = G( α¯) . α ∗of This is a c o ntradictio n to the optimality . Hence,β¯ is a solution of (10) and hβ∗ = hβ¯ follo w s. Then, w ith (14) w e ha v e the desired r esult. 4. Impl ementa tion O ne of the mo st widely used techniques f or solvin g S VMs ef ficie ntly is the sequential minimal optimiz ation (SMO ) (Platt, 1999). A w ell kno wn imp leme ntatio n of this te chnique is LIBS VM (Chang and L in , 201 E 1)mpirically . , S MO sca le s quadratically with the number of tr ain ing p oin ts (Pla tt, 1999).N ote th at in pairw ise cla ssifi catio n the train ing p oin ts are th e tr a ining If all pair s. possible tr ain ing pair s ar e used, then the number of trainin g pairs gro w s quadratically w ith the numberm of tr ain ing e xampleHence, s. th e runtime of L IBSV M w ould scale quar tic a lly m. w ith In Section 4.1 w e d iscuss ho w the costs for e v a luating pairwis e k er nels , whic h can be e xpr essed by standard k ernels, can be drastically reduced. In Section 3 w e dis cussed that one c an either use balanced k er n e ls or symmetric tr ain ing sets to enf o r ce the symmetr y of a pairwis e de cis ion function. Ad ditionally w , e s h o w ed th at both approaches lead to th e same decis io n Section functio n.4.2 compar es th e nee ded tr a ining times of the approach w ith balanced k er nels and the approac h w ith symmetr ic trainin g sets. 2285 BRUNNER, F ISCHER,L UIGANDT HIES 4.1 Cach in g the Standard K er nel In this su bse ctio n balanced k ernels ar e u se d to e nforce the symmetr y of th e pairw ise decis io n f unc tion. K er nel e v aluations ar e cr ucia l for the perf orma nce of L I B SV M. If w e could c ache th e whole k er n e l ma trix in R AM we w o uld get a huge incr ease of speed. T oday , this se ems imp ossib le for sig ,250 tr ain ing p a ir s as stor ing the (symmetr ic) k er nel ma trix for this nu mb er nificantly mo re than 125 of pair s in d ouble precision ne eds approximate ly 59G NoteB.th at tr ain ing sets with 500 tr a ining e xamp le s alr e ady re su lt,250 in 125 tr a ining p a irNo s. w , w e de scribe ho w the costs of k ernel e v aluations can be drastically reduceFd.or e xa mple , le t us se le ct theKkT eL (5) rnelw ith an ar bitrary stan da rd k ernel. F or a sin gle e v aluation KTofL the sta ndar d k er nel has to b e e v alu ate d four times with v ector s of X. A f te rw ards, four arithme tic oper atio ns are needed. I t is ea sy to see that each sta ndar d k ernel v alue is used for e v alu atin g ma n y dif f erent ele ments of the k ernel matr ixIn. general, it is possible to cache the sta ndard k erne l v alues for a ll tr a ining e xamp le s. F or e xample , to c ache the sta ndar d k ernel,000 v alues e xamples for 1 0o ne nee d s 400MB. Thu s , each k ernel e v aluation KT L of costs four ar ith metic o pe ration s only . T his d oe s n ot de p e nd on the chosen s ta ndard k ernel. T able 1 c ompares the trainin g times w ith and with out caching th e stand a rd k eFrnel or v alues. these measurements e xamples f rom the doub le inte r v al ta s k (c f. Section 5.1) ar e use d where each class is r epresented by 5 e xa mple KT L iss,chosen as pairwis e k er nel with a lin e ar stan da rd k ernel, a cache size of 100MB is sele cted for ca ching pair wis e k e rnel v alues, and all possible pair s a re used for tr ain ing. In T able 1a the trainin g set of each run c on 250of e xample s of 50 cla sse s w ith m =sists dif fer ent dimensions s of dimension n. T able 1b sho w s results for dif fere n t number m of e xamples n = 500. Th e spee d up f acto r by the describe d cac h in g te c h niq ue is up to 100. D imension S tandard k e rnel (time in mm:ss) n of e x a mple snot c ached cached 200 400 600 800 1000 2:08 4:31 6:24 9:41 11:27 N umber Stand a rd k e rnel (time in hh :mm) m of e xamples no t ca chedcached 0:07 0:07 0:07 0:08 0:09 200 400 600 800 1000 (a) Dif fe rent dime ns io nsf e x a m p le s no 0:04 1:05 4:17 12:40 28:43 0:00 0:01 0:02 0:06 0:13 (b ) Dif fere n t numbers m of e xam p le s T able 1: T ra ining time with and with out caching the sta ndard k erne l 4.2 Balanced K er nels vs. Sy m metric T raining S e ts Theorem 3 sho ws that pa ir wis e SV Ms whic h use symme tric tr ain ing sets and pairw ise S VMs w ith balanced k ernels lead to the same decis io n functio F orn.symme tric trainin g s ets the numbe r of training pairs is near ly double d compare d to the number in the case of balan c ed S imultak ernels . neously , (11) sho ws tha t e v alu atin g a balanced k er nel is c o mp uta tionally more e xpensi v e compa r to the corresponding non balan c ed k er n e l. 2286 P AIRWISES V M SANDL ARGES CALEP ROBLEMS T able 2 compar es the needed trainin g time of both approac h e s. T her e, e xamples from the doub le interv a l ta sk ( cf. S ec tio n 5.1) o f dime n =nsio 500nare u se d w her e e ach cla ss is re p r esente d by 5 e xamp le Ks,T a n d its balanced v er sion KT L w ith linear stand a rd k ernels are chosen a s pairw ise k er n e l, a cache size of 1 00MB is se le cted for caching the pair wis e k ernel v alue s, and all possible pairs are used for trainin g . I t tur n s out, that the appr o a ch with balanced k ernels is thre e to four times f a ste r th an using symme tric trainin gOfs course, ets . the technique of caching the s ta ndard k er n e l v alu es a s desc ribed in Section 4.1 is used within all me asure me nts . Numberm Symme tric training setBala nc ed k ernel of e xamp le s (t in hh:mm) 500 1000 1500 2000 2500 0:0 3 0:4 6 3:2 6 9:4 4 23:1 5 0:01 0:17 0:56 2:58 6:20 T able 2: T ra ining time for symmetr ic tr ain ing sets a nd for bala nce d k ernels 5. Classificat i o n Exper iments In this section we w ill present results of applying pairw ise S VMs to one sy nth etic data set and to one real w orld data set. Befor e we come to those data sets in Sections 5.1 and 5.2 we KTl inin L tr oduce po l y and KT L . Those k e rnels denote KT L (5) with linear sta ndard k ernel and homogenous polynomial pol y pol y lin l in stan da rd k ernel of de gree tw o, respecti The v ely k ernels . KM KT M are defined L, KM L , KT M, a n d analogously . In the follo wing, detectio n err or trade-of f curv es (D ET curv es c f. Gama ssi e t al., 2004) w ill be u s ed to me a sure the p e rformance of a p a ir wis e c la ssifier . Such a cur v e sho ws for a n y f a ma tch r ate (FMR ) the corresponding f a ls e non matc h rate (FNMR ). A specia l point of in te re st of such a curv e is the (appr o ximated) equal er ror rate ( E ER ) , that is the v alue for w hich FMR =FN MR holds. 5.1 Dou ble Inter v al Task Let us descr ibe the double in te rval ta of skdimensionn. T o get s u c h an e xample x ∈ {o 1, 1} n one dra w is, j , k, l ∈ so that 2≤ i ≤ j , j + 2 ≤ k ≤ l ≤ n and defines N xp ≔ _ 1 p ∈ { i ,..., j } ∪ {k,..., l } , o 1 otherwis e. ) ≔ ( i , k) . Note that the pair( j , l ) does not influence The classc of such an e xample is gi v en c( xby o 3) ( n o 2) / 2 cla sses . the class. H ence, the re( n are F or o ur me asure me nts we selecte n = 500dand te sted all k ernels in (4)–(7) w ith a linear standard k erne l and a homogenous polynomial standard k er nel o f de gre e tw o, r especti v ely . W e cr eate d a te set consistin g of 750 e xamples of 50 c la sses so th at each c la ss is repr esente d by 15 e xample s. An y training set w as generated in such a w ay that th e set of classes in the tr ain ing set is dis join t from the 2287 BRUNNER, F ISCHER,L UIGANDT HIES 1 1 lin 50 Classes K ML lin 100 Classes K ML lin 200 Classes K ML poly K TM 50 Classes poly K TM 100 Classes poly K TM 200 Classes 0.9 0.8 0.7 0.8 0.7 0.6 FNMR FNMR 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0.001 lin K ML poly K ML lin K TL poly K TL lin K TM poly K TM 0.9 0.01 0.1 1 FMR 0 0.001 0.01 0.1 1 FMR (a) Dif fe r e n t c lass n umbe r s in tra inin g (b) D if fere n t k ernels f or 2 00 cla s se s in tra inin g F ig ure 1 : DE T curv es for doub le in te rv al ta sk set of cla ss es in the te st set. W e cre ate d training se ts consistin g of 50 classes and dif fe rent n umb ers of e xamples per class. F or tr ain ing all po ss ible training pair s w er e use d . W e observ ed th at an inc reasing number of e xamp le s per class impro v es th e perfor ma nce in dependently of th e o th er para me te r s. As a tr ade -of f between the needed tr ain ing time and p e rformanc of the classifier , we decided to use 15 e xamples pe r c la ss for th e me asur I n de eme p e nts nd .e ntly of the se le cted k ernel, a penalty paramete C of 1,r000 tu rned out to be a good choic e. T he kKeDSrnel led to a bad perfor ma nce r e g ar dle ss of the sta ndard k ernel Theref chosen. o r e, w e omit results f or KDS. Figure 1a sho ws that an increa sing numb er of c la sses in the training set impro v es th e per for ma nce significantly . T his holds for a ll k e rnels mentio ned a bo v e. H ere , we only prese nt results f or pol y l in KM L andKT M . F igur e 1b sho ws the D ET c u r v es for dif f erent k e rnels w he re the tr a ining set consis t of 20 0 cla sses . In p a rticula r , an y o f the pair wis e k ernels whic h uses a homogeneous p oly no mial o de gr ee 2 as standard k ernel le ads to b e tte r re sults than its corr espon din g counte r part w ith a lin e a p ol y stan da rd k e rnel. F or FMR s sma ller than 0.07 KT M leads to the best results, w he reas for la r ger pol y pol y po l y FMR s the D ET curv e KsMofL , KT L , andKT M intersect. 5.2 Lab e le d F aces in t h e W ild In this subsec tio n w e will pr esent results of applying pairw is e SVMs to th e labele d f a ces in the wild (L FW) data set (Huang et al., 2007). This data set consis,233 ts ofimages 13 of ,5749 pers o ns. Se v e ral remarks on th is d a ta set are in order . H uang et al. ( 2 007) suggest tw o protocols for per for ma nce me a sureme nts. Here , the unrestric ted pr otocol is u s ed. This protoc o l is a fi x e d te nf old c ro v alidation where ea ch test set consists of 300 p ositi v e pairs and 300 ne g ati v e p a ir s. Moreo v er , an person (class) in a training set is not par t of th e cor responding test set. There ar e se v er al fea ture v ecto rs a v ailab le for the LF W data set. F or the p r esente d measureme w e ma in ly f ollo w ed L i et al. (2012) a n d used the sca le -in v aria nt f eatur e transform ( S I F T)-bas ed fea ture v e ctors for the f unn e le d v er sion (Guillaumin e t al., 2009) of LFW. I n addition, the aligned × 150 pix images (W olf e t al., 2009) a re used. F or this , the aligned images are cropped to 80 e ls and are then normalized by passing th em thr oug h a log function (c f. L i et al.,A2012). f te rw ards, the 2288 P AIRWISES V M SANDL ARGES CALEP ROBLEMS 1 0.9 lin K ML poly K ML lin K TL poly K TL lin K TM poly K TM 0.9 0.8 0.7 0.7 0.6 FNMR FNMR 0.6 0.5 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0.001 SIFT LBP TPLBP LBP+TPLBP SIFT+LBP+TPLBP 0.8 0.01 0.1 1 FMR 0 0.001 0.01 0.1 1 FMR p ol y ( a ) V ie w 1 p a rtition , dif fere n t k e r ne ls, a d de d up e c isio icted n (b )dUnrestr p rotocKoT l,M , dif ferent f e atu re v ec to r s, f unction v a lue s o f S IFT , L P B , a nd T PLB P fe atu“ +r e v ec tors ” stand s for add in g up th e corr e sp ond in g d e cis io n f unction v alu e s F ig ure 2 : DE T curv es for L FW data set loca l bin ary patterns (L BP) (Oja la et al., 2002) and three- patc h L BP ( T PLB P) (W olf et al., 2008) are e xtracted. In contr a st to Li et a l. (2012), the pose is neith er estimated nor sw a p pe d and no PCA is applied to the data.A s the norm of the L BP featur e v ec tors is not th e same for a ll images we scaled th em to Euclid e an norm 1. F or mode l se le ction, the V ie w 1 par titio n of the L FW data bas e is recommended (Huang et al., 2007). U sin g all possible pairs of this partitio n f or trainin g and for te sting, we obtained that a penalty pol y para me te C rof 1,000 is suita ble.Mor eo v er , f or e ach used featu re v ector , K the ernel ads to T Mk le the be st r esults among all used k e rnels and als o if sums of decision f u nc tio n v a lues belo ng in g to SIFT , LB P, and TP LBP f eatur e v ec tors are F or used. e xample, Figure 2a sho w s the pe rformance of dif fere nt k ernels, w he re the decis io n function v alues corr esponding to SIFT , LB P, and T PLB P fea ture v ector s a re added up. Due to the spe ed up te chniques pr esente d in S ec tio n 4 w e w ere able to tr ain w ith la r ge n umb er of tr ain ing pairs. H o we v er , if all pair s w e re used for tr a ining, then a n y tr a ining set w ould consis t o ,000,000 pairs and the tr ain ing w ould still nee d too muc hHence, approximate ly 50 time. w her eas in an y train ing set all positi v e trainin g pairs were used, the ne g a ti v e tr ain ing pairs w er e r ando mly ,000,000 selecte d in such a w ay th at an y trainin g set consis ts ofpairs. 2 The trainin g of such a mo del took less than 24 ho ur s on a sta ndard PC. In Figure 2b w e present th e a v er age D ET curv es obtained pol y for KT M and f eature v e ctors based on SIFT , L BP, and TPL BP. Inspired b y Li et al. ( 2 012) , we determin e d tw o f urther DE T curv es by a d din g up th e dec is ion function v a lues. This le d to v ery goo results. Further more, w e concatenate d th e SIFT , LBP, a n d T PLB P f eature v ec tors. Surpris in gly , th training of some of those models needed lo nger than a week. Therefor e, we do not prese n t th ese results. I n T able 3 the mean equal e rror rate (EE R) a nd the standard er ror of the me an (SEM) o btained fr o m the te nfold cr oss v alidatio n are pr o vided for se v eral ty pes of f eatur e v ectors. Note , that ma n y of our re sults ar e compar able to the state of the art or e v en better . The curr ent sta te of the ar t can be found on the homepage of Huang et al. ( 200 7) a nd in th e publication of Li et al. (2012). If only SIFT± 0is .0040 based fea ture v ectors a re used, then the best kn o w n.125 re sult 0 (EE R± SEM). W ith 2289 BRUNNER, F ISCHER,L UIGANDT HIES .1252± SE pairw ise SV Ms w e achie v ed th e same EE R b ut a slig htly higher 0.0062. M 0 If w e add up the decision function v alues corr espon din g to th e LBP and T PL BP featur e v ector s, then our result .1050 0.1210± 0.0046 is w orse compare d to th e sta te of th e art 0± 0.0051. O ne possible reason for th is f act might be that w e did not sw ap the pose. Finally , for the added up dec is ion function .0947 v alu es corre spon din g to S I F T , L BP a n d TP LBP f eature v ector s, our per± forma 0.0057nce 0 . ± . is better th an 00993 0 0051. F ur thermo re, it is w orth noting th at our stand a rd e rrors of th e me an are compara b le to th e other pr esente d le ar n in g algorithms although mos t of them use a PC A to reduce nois e and dimension of the featur e v ectors. N ote that the re sults of the comme r cia l system are not directly compar able since it use s outside tr a ining data (for r efer ence s ee Huang e t al., 20 07) . S IFT LB P TP LB P L+ T S+L +T CS P airw ise Mean E ER 0.1252 SVM SE M 0.0062 0.1497 0.1452 0.1210 0.0947 0.0052 0.0060 0.0046 0.0057 - State of Mean E ER 0.1250 the A r t SE M 0.0040 0.1267 0.1630 0.1050 0.0993 0.0870 0.0055 0.0070 0.0051 0.0051 0.0030 T able 3:Mea n EE R a nd SE M for LF W data set. S=S I F T , L=LB P, T=T PL BP, +=adding up decision f u nc tio n v alues , CS=C ommercialface.com system r2011b 6. Fi na l Remarks In this p a per we suggested the S VM fr ame w or k for handlin g lar ge p a ir wis e cla ssific atio n p r oble m W e analy zed tw o approaches to enforce the symmetr y of the o btained claTssifie o the rs. best of our kno w ledge, w e g a v e the first proof th at symme try is indeed achie v ed. Then, we pro v ed that f or each par ame ter set of one approa ch there is a cor responding para me ter set of the oth er one such th at both approaches le ad to th e same classifier Additionally . , we sho w ed that the approach b a sed on balanced k er n e ls le ads to shor te r training times. W e discussed deta ils of the imp leme ntatio n of a pairw ise SVM solv er and pre sente d numerical results. Those results demonstrate th at pairw is e SVMs are capable o f successfully treatin g lar ge scale pa ir wis e classification pr o blems . F ur thermor e, we sho w e d tha t pair wis e SV Ms compete v er well for a r eal w or ld data se t. W e w ould lik e to under lin e that some of the dis c u ss ed techniques could be transfer red to oth er approac h e s for solving pairw ise classification p r oble F orms e xample, . most of th e results can be applied ea sily to One Cla ss Supp or t Vec tor Ma chines (S c h ö lk opf et al., 2001). Ackno wledgments W e w ould lik e to th ank the unkno w n refe rees for their v aluable comments and suggestions. 2290 P AIRWISES V M SANDL ARGES CALEP ROBLEMS Ref e r e nces J. Ab e rneth y , F . B a ch, T . Evg e niou, a n d J.-P. Ver t. A ne w appr o a ch to collaborati v e fi lter ing: O ator estimation with spec tr al re gu larization J ournal . of Mac hine Learning Resear , 10:803–826, ch 2009. A. B ar -Hillel a n d D. W einshall. Learnin g dista nce f u nc tio n by coding simila r ity . In Z. G hahr ama ni, editor Pr , oceedin g s of the 24th Inte rnational C onfe r ence on Mac hin e Learning , pages (ICML ’ 07) 65–72 . A C M, 2007. A. B ar -Hillel, T . H e rtz , and D . W einshall. Boostin g ma r gin base d dista nce functions for cluster ing. I n C. E. Brodle y , editor In ,P r ocee d in gs of the 21st Internatio nal Confe r e nce on Mac hine , pages 393– 400. A CM, 200 4a . Learning (I C M L ’04) A. B ar -Hillel, T . H er tz , and D. W ein shall. L e arning dis tance functions for imagePrretrie o- v al. In ceedin g s of the IEEE Com puter Socie ty C onfer ence on C ompute r V is io n an d P attern Reco gnition (CV PR ’04) , v olume 2, pages 570–577. IEE E Computer Society Press, 2 004b. A. B en- H ur and W. Sta f ford Nob le. K ernel me thods for predicting protein–protein inte r actions. , 21(1):38–46, 2005. B io informatics C. B r unnerA. , Fis cher K , . L uig, and T . Thie s.P air wis ke ernels,support v ectorma chine s, and the application to la r gescale problems .T echnic aReport l MAT H-NM-04-2011, I nät stitute o f Nu mericalMath ema tics,T echnis c hUeni v er sitDresden, O ctobe r2011. UR L . http://www.math.tu-dresden.de/˜fischer C.-C. C hang and C .-J. L in . LIBSV M: A lib r ary f or su ppor t v e ctor ma chine s. A CM T r ansactionson Intellig ent Sy ste msand Tec hnolo ,gy 2(3):1 –2 6, 2011. UR L . http://www.csie.ntu.edu.tw/˜cjlin/libsvm (August 2011) K . Du a n a n d S. S. K eerthi. Which is the b e st multic la ss SV M method? A n empiric a l study . In N. C. O za , R. Po likar , J. Kittle r , and F . Ro li,Peditors, r oce edings of th e 6th I n ternatio nal Wo r kshop on , pages 27 8–285. Springer , 2005. Multiple C la ssifier System s M. Gama ssi, M. Lazza roni, M. Mis in o , V. Piuri, D. S a n a , and F . Scotti. A cc urac y and pe rformance of biome tric syste ms.Pr In oceedings of the 2 1th I E EE I nstr um e ntatio n and Me asur em ent Tec hnolo gy C onfer ence ( IMTC, ’04) pages 510–51 5. IEEE , 2 004. M. Guillaumin ,J. Verbeek,and C. Schmid .Is tha tyou? Metr ic le a rningapproache for s f ace identificatio n. InP r ocee d in gs of the 12th Interna tional Confe r ence on Com puter V is ion (ICC V ’09), pages 4 98–505, 2009. U RL http://lear.inrialpes.fr/pubs/2009/GVS09 (August 20 11) . S. I. Hill and A . Doucet. A fr ame w or k f or k ernel-based multi- cate gory cla ssific atio of n. J ournal , 30( c1 h) :525–564, 2007. A r tifi cia l Inte llig enc e Resear C.-W. Hsu and C.- J. Lin. A c ompariso n o f me th od s for multiclass support v ectorIE ma EEchin es. T r ansactio ns on Neur al Networks , 13( 2):415–425, 2 002. 2291 BRUNNER, F ISCHER,L UIGANDT HIES G. B. Huang, M. Rame sh, T . Ber g, and E. L ea rned-Miller . Labeled f a ces in the wild: A database f or studying f ace rec og nition in unconstraine d e n vironme nts . T echnic a l R e p or t 07-49, Uni v ersity of Massachusetts, Amher st, October 2007.http://vis-www.cs.umass.edu/lfw/ UR L (August 20 11) . P. Li, Y. Fu, U. Moh a mme d, J. H . E lde r , and S . J. D . Prince . P r obabilistic models for infe rence abou identity .IEEE T r ansactions on P atte rn A nalysis and Mac h in e I, nte 34:1llig4 ence 4–157, 2012 . T . Oja la, M. Pietik äin en,and T . M äenp ä Multir ä. esolu tio ngra y-scale and rotatio n in v ariant te xtur e classification with lo cal bin ary pa tte r ns.I n IEE E T r an sac tio ns on P attern Analy sis a nd Mac hine Intellig ence , 24(7) :971–987 ,2002. UR L . http://www.cse.oulu.fi/MVG/Downloads/LBPMatlab (August 2011) P. J. P hillips. Support v e ctor machines applied to f ace recognition. In M. S . K earns, S. A. Solla , and D . A. C ohn, edito rAs,dvances in Neur al Informatio n Pr ocessing System , page s 11s 803–809. MIT Pres s, 19 99. J. C . Pla tt.F a sttrainin gof support v ec tormachinesusin gsequentialminimal optimiz a tio n. In B . Sch¨ olk o pf , C. J. C. B ur g e s, and A . J. S moAladvance , e d itors, s in K ernel Meth od Support s: , pages 185–208. MIT Press, 1999. Vector Learning R. R ifkin a n d A . K lau tau. In d e fense of one-vs-all cla ss ificatio n. of Mac hine Learning J ournal 200 4. R e sear, c5:101–141, h . B. Sch ölk opf and A . J. Smola Learning with K ernelsSu : pport Vecto r Mac hin es, Re gula riz ation , O ptimizatio n, and Be yond . MI T Press, 2001. B. S c h ölk opf, J. C. Platt, J. Sha we-T a y lo r , A . J. Smola, and R. C . W illiams on. Estimating th e s u p of a high-dimensional distr ib utio Neur n. al C omputations , 13( 7):1443–1471 , 2001. J. P. Vert, J. Qiu, and W. No ble. A ne w p a ir wis e k er nel for biological netw ork infer ence w ith s u pport v ecto r machines. BMC Bioinformatics , 8(Supp l 10):S8, 2007. L. W ei, Y. Ya ng , R . M. N ishika w a, and M. N . W ernic k. Learnin g of perceptua l similarity fr om e xpert re aders for mammo gra m r etr iePvr al. In oceedin gs of the I E EE Internatio nal Sym posiu m , pages on Biomedic al Ima ging ( ISB I ) 13 56–1359. IEE E, 2006. L. W o lf, T . Hassner , and Y. T aigman. Descrip tor based methods in the F awild c es. Iinn Real- Life Ima g es Work sho p a t the E ur opean Co nfer ence on Com pute r V ision, (ECC 2008. V UR’08) L . http://www.openu.ac.il/home/hassner/projects/Patchlbp (August 2011) L. W olf , T . H assner , and Y. T aigman. Simila r ity scor es based on background P rsamples. oce ed- In v olu 2, pages 88–97, 2009. ings of the 9th Asian Confe r ence on Com puter V ision ( A, C CVme ’ 09) 2292

Pairwise Support Vector Machines and their Application to Large Scale Problems

Related documents

Products

Support

Pairwise Support Vector Machines and their Application to Large Scale Problems

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib