AAAI Presentation on "ELR" -- 2002

Structure Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers Russell Greiner* and Wei Zhou University of Alberta University of Waterloo *greiner@cs.ualberta.ca Belief Net B =  V, A, … P( c,e ) X  Nodes V W Y B Distribution  Arcs A (Dependencies) ≡ Z (Variables) ≡  Parameters  Learner’s task… Ideally, minimize… truth KL( truth, B ) c,e (Conditional probabilities) Classifier Q hB(e)  truth( e ) err( B ) =  c,e E1 E2 E3 … En C + 3 T + 0 F - 4 F … - -2 T + + … - … 0 … -1 … 0.2 … -3 + -1 T … -3 B(ML) = arg maxB 1/|S| i ln PB( ci , ei) X  Given o Structure (node, arcs… not parameters) Y W P(c,e)  (c  hB(e) ) +  Given structure G = V,A = let o Labeled data sample + 3 T + 0 F - 4 F … - -2 T … 0 … -1 … 0.2 … -3 Y Z … even if only All G,  for  = O(1/N) Q * = argmax{ CL( ) |   All G,  } G has |  | = K parameters over |V| =N variables G,  + + … - Proof: X 0.2 W Y X W Z Q + + + - - + - -  P(c, e)  ln 0.8 +z|X=x,W=w -z|X=x,W=w 0.4 0.9 0.5 0.0 0.6 0.1 0.5 1.0 P (c | e) Y o Structure be the parameters that  Goal: find X3 0 1 0 + 3 T … 0 + 0 F … -1 - 4 F … 0.2 … - -2 T … -3 + + … - c ,e S 2  N2K K 3 1   2 6K ln δ  K ln γ ε   O ε 2 ln ε δ ln γ       Then, with probability at least 1-, * ˆ LCL(  ) within  of LCL( G,  ). G ,   log P (c | e) c ,e S  A=Dk 0 1 1 0 ⋮ ⋮ 0 1 0 1 . . DK • Similar bounds when dealing with err(), as with LCL() • [Dasgupta,1997] proves  N2K K 3 3N   18 NK ln( 1  3N /  )  N  2 2 1  288  K ln 1  ln N ln  ln    O 2 ln     εδ ε  ε     ε Other Algorithms… C F2 e d| f d ' e  d '| f E2 D … … E1 F1 F2 0 0 0 1 1 0 1 1 P(D=0|F1=f1,F2=f2) d|f C 2 complete tuples sufficient wrt Likehood. • Same O(.) as our bound, ignoring ln2(.) and ln3(.) terms • The  is unavoidable here… (unlike likelihood case [ATW91]) … climb along d|f ’s !  Need derivative:   * (S ) ˆ   arg max  {LCL ()}  arg max    log P (c | e)  c ,e S  XN Notes:  N ln γ  M γ,K.N (ε,δ)  18   ε   So…use “softmax” terms …  for a sample S of size  d| f  X4 D3  log P (c | e)  Not just changing {d|f }, as constraints: a. d|f  0 b. d d|f = 1   As NP-hard… Hillclimb !  How?? X2 W Z  Change each d|f to improve 1 (S ) ˆ LCL ()  |S| F1 Q o Labeled data sample  Output: o parameters  CR X1 0 How to HillClimb? ELR Learning Algorithm:  Input: … XN D2 ˆ  G ,  AllG , optimize c ,e X … 2 . that maximize LCL()  …X 0 +w -w Find parameters  X1 C1  For any ,  > 0, let = arg maxB 1/|S| i ln PB( ci | ei) (S ) ˆ L C L ()  NP-hard to find values  that minimize W All G,  = {   ParamFor(G) |  d|f , d|f   } Q (MCL) B Computational Complexity: X Z  Discriminative (learn classifier) * B = arg minB err( B ) = arg minB i ( ci  hB(ei) )  Performer Sample Complexity: Our specific task: If goal is …  Generative (learn distribution) E1 E2 E3 … En P(D=1|F1=f1,F2=f2)  When given complete data…  Compare to OFE (Observed Frequency Estimate)  Trivial algorithm… maximizes Likelihood E1 E2 Ek C 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 1 0 E1 E2 Ek C 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 1 0 2 “E1=1, C=1”s So E1=1|C=1 = 2/3 C=1 = C E1 E2 E1=1|C=1 = 2/3 2/3 E1=1|C=0 = …  ELR on Naïve Bayes structure  standard Logistic Regression  ELR deals with arbitrary structures, incomplete data Ek So C=1 = 3/5  EM (Expectation Maximation)  APN [BKRK97] – hillclimb in (unconditional) Likelihood  Relation to Logistic Regression: … 3 “C=1”s  When given incomplete data… c ,e  LCˆ L ()   d | f P( f | e, c)  P( f | e)  P(d , f | e, c)  P(d , f | e)   d| f  Optimizations: • Initialize using OFE values (not random) – “plug in parameters” • Line-search, conjugate gradient ([Minka,2001] confirms these effectie for Logistic Regression) • Deriv = 0 when D and F are d-separated from E and C… and so can be ignored! E2 E1 … Ek 3/5 Empirical Results NaïveBayes Structure TAN Structure C C E1 • NaïveBayes Structure – Attributes independent, given Class • 25 Datasets – 23 from UCI, continuous + discrete – 2 from SelectiveNB study – (used by [FGG’96]) E1 E2 Ek C 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 1 0 TAN structure: Chess domain E2 Ek C 1 1 0 1 0 1 1 1 1 * 1 1 0 0 0 0 * * 0 All 25 Domains  Below y=x  NB+ELR better than NB+OFE  Bars are 1 standard deviation  ELR better than OFE ! (p<0.005)  OFE works only with COMPLETE data  Given INCOMPLETE data:  EM (Expectation Maximization)  APN (Adaptive Probabilistic Networks )  Experiments using  NaïveBayes, TAN  Permits dependencies between attributes  Efficient Learning alg; Classification alg  Works well in practice… [FGG’97] #0 E1  – So if structure is wrong, cannot do well!  “Discriminative” Learner (ELR) – not as constrained! Other Studies  Given data: 1. Use PowerConstructor [CG02,CG99] to build structure 2. Use OFE vs ELR to find parameters  For Chess: Insert fig 2b from paper! Correct structure, incomplete data  Consider Alarm [BSCC89] structure (+ param):  36 nodes, 47 links, 505 params  Multiple queries  8 vars as pool of query vars  16 other vars as pool of evidence vars  Each query: 1 q.var; each evid var w/prob ½ … so expect 16/2 evidence  NOTE: different q.var for different queries! (Like multi-task learning)  Results: Insert fig 6c from paper! E2 E3 E4 E1 C E2 E1 E4  TAN+ELR > TAN+OFE (p<0.025) C #2 E3  TAN+ELR  NB+ELR E2 E3 E4 P(C) = 0.9 P(Ei|C) = 0.2 P(Ei|~C) = 0.8  … then P(Ei|E1)=1.0, P(Ei|~E1)=0.0 when “joined” for model#2, model#3, … Measured Classification Error  k=5, 400 records, … 25% MCAR omissions:  TAN+ELR  TAN+EM  TAN+APN  TAN algorithm problematic… – as incomplete data Summary of Results  OFE guaranteed to find parameters – optimal wrt Likelihood – for structure G  If G incorrect… – optimal-for-G is bad wrt true distribution  wrong answers to queries  … ELR not as constrained by G…  can do well, even when structure incorrect!  ELR useful, as structure often incorrect  to avoid overfitting  constrained set of structures (NB, TAN, …)  See Discriminative vs Generative learning… TAN + ELR > TAN + OFE NB + ELR > NB + OFE  Incomplete data  NB  APN  NB  ELR     NB  EM   ELR was relatively slow –  0.5 sec/iteration for small, … minutes for large data – much slower than OFE •  APN/EM –… same alg for Complete/INcomplete data … ELR used unoptimized JAVA code Related Work  Lots of work on learning BNs … most Generative learning  Some discriminative learners but most…  learn STRUCTURE discriminatively  then parameters generatively !  See also Logistic Learning  [GGS’97] learns params discriminatively but… different queries, L2-norm (not LCL) needed 2 types of data-samples, … Future work: Analysis Complete Data Nearly correct structure  #1 C  TAN+ELR did perfectly on CORRAL! Complete data:  Compare NB+ELR to NB+OFE wrt – increasingly “non-NB data”  Why does ELR work so well – vs OFE (complete data) – vs EM / APN (incomplete data) for fixed simple structure (NB, TAN) ?  “Generative” Learner (OFE/APN/EM) – very constrained by structure…  NB+ELR better than NB+EM, NB+APN (p<0.025) TAN can deal with depend attributes, NB cannot … but ELR is designed to help classify OFE is not NB does poorly on CORRAL • artificial dataset, fn of 4 attribute Gen’l: NB+ELR  TAN+OFE Link from Class node to each attribute Tree-structure connecting attributes Correctness of Structure  “Missing Completely at Random” [BKRK97] Ek Ek ELR-OFE:  Initialize params using OFE values  Then run ELR E1  So far, each dataset complete 0  includes value of  every attribute in each instance  Now… some omissions  Omit values of attributes  w/ prob = 0.25 Ek E7 • Complete Data – Every attribute of every instance specified Missing Data … E2 E1 E2 1.  2.  Contributions: • Motivate/Describe – discriminative learning for BN-parameters • Complexity of task (NP-hard, poly sample size) • Algorithm for task, ELR – complete or incomplete data – arbitrary structures – soft-max version, optimizations, … • Empirical results showing ELR works + study to show why… C E2 …  Use SIMPLE (quick-to-learn) structure  Focus computational effort on getting good parameters E2 Ek • Clearly a good idea… – should be used for Classification Tasks! C=1 C  Why not… E1 arbitrary structure incomplete data What is complexity if complete data? … simple structure?  Most BN-learners – Spend LOTS of time learning structure – Little time learning parameters Learn STRUCTURE as well… discriminately NP-hard to learning LCL-optimal parameters …  TradeOff E1 Now: assume fixed structure … Ek E=1|C=1 This work was partially funded by NSERC and by Syncrude.

AAAI Presentation on "ELR" -- 2002

Related documents

Products

Support

AAAI Presentation on "ELR" -- 2002

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib