AAAI Presentation on "ELR" -- 2002

advertisement
Structure Extension to Logistic Regression:
Discriminative Parameter Learning of Belief Net
Classifiers
Russell Greiner* and Wei Zhou
University of Alberta
University of Waterloo
*greiner@cs.ualberta.ca
Belief Net
B =  V, A, …
P( c,e )
X
 Nodes V
W
Y
B
Distribution
 Arcs A (Dependencies)
≡
Z
(Variables)
≡
 Parameters 
Learner’s task…
Ideally, minimize…
truth
KL( truth, B )
c,e
(Conditional probabilities)
Classifier
Q
hB(e)
 truth( e )
err( B ) =

c,e
E1 E2 E3 … En
C
+ 3 T
+ 0 F
- 4 F
…
- -2 T
+
+
…
-
… 0
… -1
… 0.2
… -3
+ -1 T … -3
B(ML) = arg maxB 1/|S| i ln PB( ci , ei)
X
 Given
o Structure
(node, arcs… not parameters)
Y
W
P(c,e)  (c  hB(e) )
+
 Given structure G = V,A =
let
o Labeled data sample
+ 3 T
+ 0 F
- 4 F
…
- -2 T
… 0
… -1
… 0.2
… -3
Y
Z
… even if only All G,  for  = O(1/N)
Q
*
= argmax{ CL( ) |   All G,  }
G has
|  | = K parameters over |V| =N variables
G, 
+
+
…
-
Proof:
X
0.2
W
Y
X W
Z
Q
+
+
+
-
-
+
-
-
 P(c, e)  ln
0.8
+z|X=x,W=w -z|X=x,W=w
0.4
0.9
0.5
0.0
0.6
0.1
0.5
1.0
P (c | e)
Y
o Structure
be the parameters that
 Goal: find
X3
0
1
0
+ 3 T … 0
+ 0 F … -1
- 4 F … 0.2
…
- -2 T … -3
+
+
…
-
c ,e S
2
 N2K K 3 1 
 2
6K
ln δ  K ln γ ε   O ε 2 ln ε δ ln γ 




 Then, with probability at least 1-,
*
ˆ
LCL( 
)
within

of
LCL(
G,  ).
G , 
 log P (c | e)
c ,e S

A=Dk
0
1
1
0
⋮
⋮
0
1
0
1
.
.
DK
• Similar bounds when dealing with err(), as with LCL()
• [Dasgupta,1997] proves
 N2K K 3
3N   18 NK ln( 1  3N /  ) 
N 
2
2 1

288  K ln 1 
ln N ln 
ln 
  O 2 ln
  

εδ
ε
 ε 


 ε
Other Algorithms…
C
F2
e
d| f
d ' e
 d '| f
E2
D
…
…
E1
F1
F2
0
0
0
1
1
0
1
1
P(D=0|F1=f1,F2=f2)
d|f
C
2
complete tuples sufficient wrt Likehood.
• Same O(.) as our bound, ignoring ln2(.) and ln3(.) terms
• The  is unavoidable here… (unlike likelihood case [ATW91])
… climb along d|f ’s !
 Need derivative:


*
(S )
ˆ
  arg max  {LCL ()}  arg max    log P (c | e)
 c ,e S

XN
Notes:
 N ln γ 
M γ,K.N (ε,δ)  18

 ε 
 So…use “softmax” terms
…

for a sample S of size
 d| f 
X4
D3
 log P (c | e)
 Not just changing {d|f }, as constraints:
a. d|f  0
b. d d|f = 1

 As NP-hard… Hillclimb !
 How??
X2
W
Z
 Change each d|f to improve
1
(S )
ˆ
LCL () 
|S|
F1
Q
o Labeled data sample
 Output:
o parameters 
CR
X1
0
How to HillClimb?
ELR Learning Algorithm:
 Input:
…
XN
D2
ˆ

G ,  AllG ,
optimize
c ,e
X
…
2
.
that maximize
LCL() 
…X
0
+w -w
Find parameters 
X1
C1
 For any ,  > 0,
let
= arg maxB 1/|S| i ln PB( ci | ei)
(S )
ˆ
L
C
L
()
 NP-hard to find values  that minimize
W
All G,  = {   ParamFor(G) |  d|f , d|f   }
Q
(MCL)
B
Computational Complexity:
X
Z
 Discriminative (learn classifier)
*
B = arg minB err( B )
= arg minB i ( ci  hB(ei) ) 
Performer
Sample Complexity:
Our specific task:
If goal is …
 Generative (learn distribution)
E1 E2 E3 … En
P(D=1|F1=f1,F2=f2)
 When given complete data…
 Compare to OFE (Observed Frequency Estimate)
 Trivial algorithm… maximizes Likelihood
E1
E2
Ek
C
1
1
0
1
0
1
1
1
1
0
1
1
0
0
0
0
0
1
1
0
E1
E2
Ek
C
1
1
0
1
0
1
1
1
1
0
1
1
0
0
0
0
0
1
1
0
2 “E1=1, C=1”s
So E1=1|C=1 = 2/3
C=1 =
C
E1
E2
E1=1|C=1 = 2/3
2/3
E1=1|C=0 = …
 ELR on Naïve Bayes structure  standard Logistic Regression
 ELR deals with arbitrary structures, incomplete data
Ek
So C=1 = 3/5
 EM (Expectation Maximation)
 APN [BKRK97] – hillclimb in (unconditional) Likelihood
 Relation to Logistic Regression:
…
3 “C=1”s
 When given incomplete data…
c ,e
 LCˆ L ()
  d | f P( f | e, c)  P( f | e)  P(d , f | e, c)  P(d , f | e)
  d| f
 Optimizations:
• Initialize using OFE values (not random) – “plug in parameters”
• Line-search, conjugate gradient
([Minka,2001] confirms these effectie for Logistic Regression)
• Deriv = 0 when D and F are d-separated from E and C… and so can be ignored!
E2
E1
…
Ek
3/5
Empirical Results
NaïveBayes Structure
TAN Structure
C
C
E1
• NaïveBayes Structure
– Attributes independent, given Class
• 25 Datasets
– 23 from UCI, continuous + discrete
– 2 from SelectiveNB study
– (used by [FGG’96])
E1
E2
Ek
C
1
1
0
1
0
1
1
1
1
0
1
1
0
0
0
0
0
1
1
0
TAN structure:
Chess domain
E2
Ek
C
1
1
0
1
0
1
1
1
1
*
1
1
0
0
0
0
*
*
0
All 25 Domains
 Below y=x 
NB+ELR better than NB+OFE
 Bars are 1 standard deviation
 ELR better than OFE ! (p<0.005)
 OFE works only with COMPLETE data
 Given INCOMPLETE data:
 EM (Expectation Maximization)
 APN (Adaptive Probabilistic Networks
)
 Experiments using
 NaïveBayes, TAN
 Permits dependencies between attributes
 Efficient Learning alg; Classification alg
 Works well in practice… [FGG’97]
#0
E1

– So if structure is wrong, cannot do well!
 “Discriminative” Learner (ELR)
– not as constrained!
Other Studies
 Given data:
1. Use PowerConstructor [CG02,CG99] to build structure
2. Use OFE vs ELR to find parameters
 For Chess:
Insert fig 2b from paper!
Correct structure, incomplete data
 Consider Alarm [BSCC89] structure (+ param):
 36 nodes, 47 links, 505 params
 Multiple queries
 8 vars as pool of query vars
 16 other vars as pool of evidence vars
 Each query: 1 q.var; each evid var w/prob ½
… so expect 16/2 evidence
 NOTE: different q.var for different queries!
(Like multi-task learning)
 Results:
Insert fig 6c from paper!
E2
E3
E4
E1
C
E2
E1
E4
 TAN+ELR > TAN+OFE
(p<0.025)
C
#2
E3
 TAN+ELR  NB+ELR
E2
E3
E4
P(C) = 0.9 P(Ei|C) = 0.2 P(Ei|~C) = 0.8
 … then P(Ei|E1)=1.0,
P(Ei|~E1)=0.0 when “joined”
for model#2, model#3, …
Measured Classification Error
 k=5, 400 records, …
25% MCAR omissions:
 TAN+ELR  TAN+EM
 TAN+APN
 TAN algorithm problematic…
– as incomplete data
Summary of Results
 OFE guaranteed to find parameters
– optimal wrt Likelihood
– for structure G
 If G incorrect…
– optimal-for-G is bad wrt true distribution
 wrong answers to queries
 … ELR not as constrained by G…

can do well, even when structure incorrect!
 ELR useful, as structure often incorrect
 to avoid overfitting
 constrained set of structures (NB, TAN, …)
 See Discriminative vs Generative learning…
TAN + ELR > TAN + OFE
NB + ELR > NB + OFE
 Incomplete data
 NB  APN 
NB  ELR  

 NB  EM 
 ELR was relatively slow
–  0.5 sec/iteration for small, … minutes for large data
– much slower than OFE
•  APN/EM
–… same alg for Complete/INcomplete data
… ELR used unoptimized JAVA code
Related Work
 Lots of work on learning BNs
… most Generative learning
 Some discriminative learners but most…
 learn STRUCTURE discriminatively
 then parameters generatively !
 See also Logistic Learning
 [GGS’97] learns params discriminatively but…
different queries, L2-norm (not LCL)
needed 2 types of data-samples, …
Future work:
Analysis
Complete Data
Nearly correct structure

#1
C
 TAN+ELR did perfectly on CORRAL!
Complete data:
 Compare NB+ELR to NB+OFE wrt
– increasingly “non-NB data”
 Why does ELR work so well
– vs OFE (complete data)
– vs EM / APN (incomplete data)
for fixed simple structure (NB, TAN) ?
 “Generative” Learner (OFE/APN/EM)
– very constrained by structure…
 NB+ELR better than NB+EM,
NB+APN
(p<0.025)
TAN can deal with depend attributes,
NB cannot
… but ELR is designed to help classify
OFE is not
NB does poorly on CORRAL
• artificial dataset, fn of 4 attribute
Gen’l: NB+ELR  TAN+OFE
Link from Class node to each attribute
Tree-structure connecting attributes
Correctness of Structure
 “Missing Completely at Random”
[BKRK97]
Ek
Ek
ELR-OFE:
 Initialize params using OFE values
 Then run ELR
E1
 So far, each dataset complete
0
 includes value of
 every attribute in each instance
 Now… some omissions
 Omit values of attributes
 w/ prob = 0.25
Ek
E7
• Complete Data
– Every attribute of every instance specified
Missing Data
…
E2
E1
E2
1.

2.

Contributions:
• Motivate/Describe
– discriminative learning for BN-parameters
• Complexity of task (NP-hard, poly sample size)
• Algorithm for task, ELR
– complete or incomplete data
– arbitrary structures
– soft-max version, optimizations, …
• Empirical results showing ELR works
+ study to show why…
C
E2
…

Use SIMPLE (quick-to-learn) structure

Focus computational effort on getting good parameters
E2
Ek
• Clearly a good idea…
– should be used for Classification Tasks!
C=1
C
 Why not…
E1
arbitrary structure
incomplete data
What is complexity if complete data? … simple structure?
 Most BN-learners
– Spend LOTS of time learning structure
– Little time learning parameters
Learn STRUCTURE as well… discriminately
NP-hard to learning LCL-optimal parameters …

TradeOff
E1
Now: assume fixed structure
…
Ek
E=1|C=1
This work was partially funded
by NSERC and by Syncrude.
Download