Structure Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers Russell Greiner* and Wei Zhou University of Alberta University of Waterloo *greiner@cs.ualberta.ca Belief Net B = V, A, … P( c,e ) X Nodes V W Y B Distribution Arcs A (Dependencies) ≡ Z (Variables) ≡ Parameters Learner’s task… Ideally, minimize… truth KL( truth, B ) c,e (Conditional probabilities) Classifier Q hB(e) truth( e ) err( B ) = c,e E1 E2 E3 … En C + 3 T + 0 F - 4 F … - -2 T + + … - … 0 … -1 … 0.2 … -3 + -1 T … -3 B(ML) = arg maxB 1/|S| i ln PB( ci , ei) X Given o Structure (node, arcs… not parameters) Y W P(c,e) (c hB(e) ) + Given structure G = V,A = let o Labeled data sample + 3 T + 0 F - 4 F … - -2 T … 0 … -1 … 0.2 … -3 Y Z … even if only All G, for = O(1/N) Q * = argmax{ CL( ) | All G, } G has | | = K parameters over |V| =N variables G, + + … - Proof: X 0.2 W Y X W Z Q + + + - - + - - P(c, e) ln 0.8 +z|X=x,W=w -z|X=x,W=w 0.4 0.9 0.5 0.0 0.6 0.1 0.5 1.0 P (c | e) Y o Structure be the parameters that Goal: find X3 0 1 0 + 3 T … 0 + 0 F … -1 - 4 F … 0.2 … - -2 T … -3 + + … - c ,e S 2 N2K K 3 1 2 6K ln δ K ln γ ε O ε 2 ln ε δ ln γ Then, with probability at least 1-, * ˆ LCL( ) within of LCL( G, ). G , log P (c | e) c ,e S A=Dk 0 1 1 0 ⋮ ⋮ 0 1 0 1 . . DK • Similar bounds when dealing with err(), as with LCL() • [Dasgupta,1997] proves N2K K 3 3N 18 NK ln( 1 3N / ) N 2 2 1 288 K ln 1 ln N ln ln O 2 ln εδ ε ε ε Other Algorithms… C F2 e d| f d ' e d '| f E2 D … … E1 F1 F2 0 0 0 1 1 0 1 1 P(D=0|F1=f1,F2=f2) d|f C 2 complete tuples sufficient wrt Likehood. • Same O(.) as our bound, ignoring ln2(.) and ln3(.) terms • The is unavoidable here… (unlike likelihood case [ATW91]) … climb along d|f ’s ! Need derivative: * (S ) ˆ arg max {LCL ()} arg max log P (c | e) c ,e S XN Notes: N ln γ M γ,K.N (ε,δ) 18 ε So…use “softmax” terms … for a sample S of size d| f X4 D3 log P (c | e) Not just changing {d|f }, as constraints: a. d|f 0 b. d d|f = 1 As NP-hard… Hillclimb ! How?? X2 W Z Change each d|f to improve 1 (S ) ˆ LCL () |S| F1 Q o Labeled data sample Output: o parameters CR X1 0 How to HillClimb? ELR Learning Algorithm: Input: … XN D2 ˆ G , AllG , optimize c ,e X … 2 . that maximize LCL() …X 0 +w -w Find parameters X1 C1 For any , > 0, let = arg maxB 1/|S| i ln PB( ci | ei) (S ) ˆ L C L () NP-hard to find values that minimize W All G, = { ParamFor(G) | d|f , d|f } Q (MCL) B Computational Complexity: X Z Discriminative (learn classifier) * B = arg minB err( B ) = arg minB i ( ci hB(ei) ) Performer Sample Complexity: Our specific task: If goal is … Generative (learn distribution) E1 E2 E3 … En P(D=1|F1=f1,F2=f2) When given complete data… Compare to OFE (Observed Frequency Estimate) Trivial algorithm… maximizes Likelihood E1 E2 Ek C 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 1 0 E1 E2 Ek C 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 1 0 2 “E1=1, C=1”s So E1=1|C=1 = 2/3 C=1 = C E1 E2 E1=1|C=1 = 2/3 2/3 E1=1|C=0 = … ELR on Naïve Bayes structure standard Logistic Regression ELR deals with arbitrary structures, incomplete data Ek So C=1 = 3/5 EM (Expectation Maximation) APN [BKRK97] – hillclimb in (unconditional) Likelihood Relation to Logistic Regression: … 3 “C=1”s When given incomplete data… c ,e LCˆ L () d | f P( f | e, c) P( f | e) P(d , f | e, c) P(d , f | e) d| f Optimizations: • Initialize using OFE values (not random) – “plug in parameters” • Line-search, conjugate gradient ([Minka,2001] confirms these effectie for Logistic Regression) • Deriv = 0 when D and F are d-separated from E and C… and so can be ignored! E2 E1 … Ek 3/5 Empirical Results NaïveBayes Structure TAN Structure C C E1 • NaïveBayes Structure – Attributes independent, given Class • 25 Datasets – 23 from UCI, continuous + discrete – 2 from SelectiveNB study – (used by [FGG’96]) E1 E2 Ek C 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 1 0 TAN structure: Chess domain E2 Ek C 1 1 0 1 0 1 1 1 1 * 1 1 0 0 0 0 * * 0 All 25 Domains Below y=x NB+ELR better than NB+OFE Bars are 1 standard deviation ELR better than OFE ! (p<0.005) OFE works only with COMPLETE data Given INCOMPLETE data: EM (Expectation Maximization) APN (Adaptive Probabilistic Networks ) Experiments using NaïveBayes, TAN Permits dependencies between attributes Efficient Learning alg; Classification alg Works well in practice… [FGG’97] #0 E1 – So if structure is wrong, cannot do well! “Discriminative” Learner (ELR) – not as constrained! Other Studies Given data: 1. Use PowerConstructor [CG02,CG99] to build structure 2. Use OFE vs ELR to find parameters For Chess: Insert fig 2b from paper! Correct structure, incomplete data Consider Alarm [BSCC89] structure (+ param): 36 nodes, 47 links, 505 params Multiple queries 8 vars as pool of query vars 16 other vars as pool of evidence vars Each query: 1 q.var; each evid var w/prob ½ … so expect 16/2 evidence NOTE: different q.var for different queries! (Like multi-task learning) Results: Insert fig 6c from paper! E2 E3 E4 E1 C E2 E1 E4 TAN+ELR > TAN+OFE (p<0.025) C #2 E3 TAN+ELR NB+ELR E2 E3 E4 P(C) = 0.9 P(Ei|C) = 0.2 P(Ei|~C) = 0.8 … then P(Ei|E1)=1.0, P(Ei|~E1)=0.0 when “joined” for model#2, model#3, … Measured Classification Error k=5, 400 records, … 25% MCAR omissions: TAN+ELR TAN+EM TAN+APN TAN algorithm problematic… – as incomplete data Summary of Results OFE guaranteed to find parameters – optimal wrt Likelihood – for structure G If G incorrect… – optimal-for-G is bad wrt true distribution wrong answers to queries … ELR not as constrained by G… can do well, even when structure incorrect! ELR useful, as structure often incorrect to avoid overfitting constrained set of structures (NB, TAN, …) See Discriminative vs Generative learning… TAN + ELR > TAN + OFE NB + ELR > NB + OFE Incomplete data NB APN NB ELR NB EM ELR was relatively slow – 0.5 sec/iteration for small, … minutes for large data – much slower than OFE • APN/EM –… same alg for Complete/INcomplete data … ELR used unoptimized JAVA code Related Work Lots of work on learning BNs … most Generative learning Some discriminative learners but most… learn STRUCTURE discriminatively then parameters generatively ! See also Logistic Learning [GGS’97] learns params discriminatively but… different queries, L2-norm (not LCL) needed 2 types of data-samples, … Future work: Analysis Complete Data Nearly correct structure #1 C TAN+ELR did perfectly on CORRAL! Complete data: Compare NB+ELR to NB+OFE wrt – increasingly “non-NB data” Why does ELR work so well – vs OFE (complete data) – vs EM / APN (incomplete data) for fixed simple structure (NB, TAN) ? “Generative” Learner (OFE/APN/EM) – very constrained by structure… NB+ELR better than NB+EM, NB+APN (p<0.025) TAN can deal with depend attributes, NB cannot … but ELR is designed to help classify OFE is not NB does poorly on CORRAL • artificial dataset, fn of 4 attribute Gen’l: NB+ELR TAN+OFE Link from Class node to each attribute Tree-structure connecting attributes Correctness of Structure “Missing Completely at Random” [BKRK97] Ek Ek ELR-OFE: Initialize params using OFE values Then run ELR E1 So far, each dataset complete 0 includes value of every attribute in each instance Now… some omissions Omit values of attributes w/ prob = 0.25 Ek E7 • Complete Data – Every attribute of every instance specified Missing Data … E2 E1 E2 1. 2. Contributions: • Motivate/Describe – discriminative learning for BN-parameters • Complexity of task (NP-hard, poly sample size) • Algorithm for task, ELR – complete or incomplete data – arbitrary structures – soft-max version, optimizations, … • Empirical results showing ELR works + study to show why… C E2 … Use SIMPLE (quick-to-learn) structure Focus computational effort on getting good parameters E2 Ek • Clearly a good idea… – should be used for Classification Tasks! C=1 C Why not… E1 arbitrary structure incomplete data What is complexity if complete data? … simple structure? Most BN-learners – Spend LOTS of time learning structure – Little time learning parameters Learn STRUCTURE as well… discriminately NP-hard to learning LCL-optimal parameters … TradeOff E1 Now: assume fixed structure … Ek E=1|C=1 This work was partially funded by NSERC and by Syncrude.