Part 2: Rule-based approaches

advertisement
Copyright © 2004 by Jinyan Li and Limsoon Wong
Rule-Based Data
Mining Methods for
Classification
Problems in
Biomedical Domains
Jinyan Li
Limsoon Wong
Copyright © 2004 by Jinyan Li and Limsoon Wong
Rule-Based Data
Mining Methods for
Classification
Problems in
Biomedical Domains
Part 2:
Rule-Based Approaches
Outline
• Overview of Supervised Learning
• Decision Trees Ensembles
–
–
–
–
–
Bagging
Boosting
Random forest
Randomization trees
CS4
Copyright © 2004 by Jinyan Li and Limsoon Wong
Copyright © 2004 by Jinyan Li and Limsoon Wong
Overview of
Supervised Learning
Computational Supervised Learning
• Also called classification
• Learn from past experience, and use the
learned knowledge to classify new data
• Knowledge learned by intelligent algorithms
• Examples:
– Clinical diagnosis for patients
– Cell type classification
Copyright © 2004 by Jinyan Li and Limsoon Wong
Data
• Classification application involves > 1 class of
data. E.g.,
– Normal vs disease cells for a diagnosis problem
• Training data is a set of instances (samples,
points) with known class labels
• Test data is a set of instances whose class
labels are to be predicted
Copyright © 2004 by Jinyan Li and Limsoon Wong
Notation
• Training data
{x1, y1, x2, y2, …, xm, ym}
where xj are n-dimensional vectors
and yj are from a discrete space Y.
E.g., Y = {normal, disease}.
• Test data
{u1, ?, u2, ?, …, uk, ?, }
Copyright © 2004 by Jinyan Li and Limsoon Wong
Process
f(X)
Training data: X
Class labels Y
A classifier, a mapping, a hypothesis
f(U)
Test data: U
Copyright © 2004 by Jinyan Li and Limsoon Wong
Predicted class labels
Relational Representation
of Gene Expression Data
n features (order of 1000)
gene1 gene2 gene3 gene4 … genen class
m samples
x11
x12
x13
x14 …
x1n
x21
x22
x23
x24 …
x2n
x31
x32
x33
x34 …
x3n
………………………………….
xm1
xm2
xm3
xm4 …
xmn
Copyright © 2004 by Jinyan Li and Limsoon Wong
P
N
P
N
Features
• Also called attributes
• Categorical features
– feature color = {red, blue, green}
• Continuous or numerical features
– gene expression
– age
– blood pressure
• Discretization
Copyright © 2004 by Jinyan Li and Limsoon Wong
An Example
Copyright © 2004 by Jinyan Li and Limsoon Wong
Overall Picture of
Supervised Learning
Decision trees
Emerging patterns
SVM
Neural networks
Biomedical
Financial
Government
Scientific
Classifiers (M-Doctors)
Copyright © 2004 by Jinyan Li and Limsoon Wong
Evaluation of a Classifier
• Performance on independent blind test data
• K-fold cross validation: Given a dataset, divide
it into k even parts, k-1 of them are used for
training, and the rest one part treated as test
data
• LOOCV, a special case of K-fold CV
• Accuracy, error rate
• False positive rate, false negative rate,
sensitivity, specificity, precision
Copyright © 2004 by Jinyan Li and Limsoon Wong
Requirements of
Biomedical Classification
• High accuracy
• High comprehensibility
Copyright © 2004 by Jinyan Li and Limsoon Wong
Importance of Rule-Based Methods
• Systematic selection of a small number of
features used for decision making. Increase
the comprehensibility of the knowledge
patterns
• C4.5 and CART are two commonly used rule
induction algorithms, or called decision tree
induction algorithms
Copyright © 2004 by Jinyan Li and Limsoon Wong
Structure of Decision Trees
x1
Root node
> a1
x2
x3
Internal nodes
> a2
x4
A
B
A
Leaf nodes
B
A
• If x1 > a1 & x2 > a2, then it’s A class
• C4.5, CART, two of the most widely used
• Easy interpretation, but accuracy generally unattractive
Copyright © 2004 by Jinyan Li and Limsoon Wong
Elegance of Decision Trees
A
B
B
Copyright © 2004 by Jinyan Li and Limsoon Wong
A
A
Brief History of Decision Trees
CLS (Hunt etal. 1966)--- cost driven
CART (Breiman et al. 1984) --- Gini Index
ID3 (Quinlan, 1986 MLJ) --- Information-driven
C4.5 (Quinlan, 1993) --- Gain ratio + Pruning ideas
Copyright © 2004 by Jinyan Li and Limsoon Wong
A Simple Dataset
9 Play samples
5 Don’t
A total of 14.
Copyright © 2004 by Jinyan Li and Limsoon Wong
A Decision Tree
sunny
humidity
<= 75
Play
2
outlook
overcast
> 75
Don’t
3
• NP-complete problem
Copyright © 2004 by Jinyan Li and Limsoon Wong
rain
windy
false
Play
Play
4
3
true
Don’t
2
Construction of a Decision Tree
• Determination of the root node of the tree and
the root node of its sub-trees
Copyright © 2004 by Jinyan Li and Limsoon Wong
Most Discriminatory Feature
• Every feature can be used to partition the
training data
• If the partitions contain a pure class of training
instances, then this feature is most
discriminatory
Copyright © 2004 by Jinyan Li and Limsoon Wong
Example of Partitions
• Categorical feature
– Number of partitions of the training data is equal to
the number of values of this feature
• Numerical feature
– Two partitions
Copyright © 2004 by Jinyan Li and Limsoon Wong
Instance #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
Sunny
Sunny
Sunny
Sunny
Sunny
Overcast
Overcast
Overcast
Overcast
Rain
Rain
Rain
Rain
Rain
Temp
75
80
85
72
69
72
83
64
81
71
65
75
68
70
Copyright © 2004 by Jinyan Li and Limsoon Wong
Humidity
70
90
85
95
70
90
78
65
75
80
70
80
80
96
Windy
true
true
false
true
false
true
false
true
false
true
true
false
false
false
class
Play
Don’t
Don’t
Don’t
Play
Play
Play
Play
Play
Don’t
Don’t
Play
Play
Play
Total 14 training
instances
Outlook =
sunny
1,2,3,4,5
P,D,D,D,P
Outlook =
overcast
6,7,8,9
P,P,P,P
Outlook =
rain
10,11,12,13,14
D, D, P, P, P
Copyright © 2004 by Jinyan Li and Limsoon Wong
Temperature
<= 70
5,8,11,13,14
P,P, D, P, P
Total 14 training
instances
Temperature 1,2,3,4,6,7,9,10,12
P,D,D,D,P,P,P,D,P
> 70
Copyright © 2004 by Jinyan Li and Limsoon Wong
Three Measures
• Gini index
• Information gain
• Gain ratio
Copyright © 2004 by Jinyan Li and Limsoon Wong
Steps of Decision Tree Construction
• Select the best feature as the root node of the
whole tree
• After partition by this feature, select the best
feature (wrt the subset of training data) as the
root node of this sub-tree
• Recursively, until the partitions become pure or
almost pure
Copyright © 2004 by Jinyan Li and Limsoon Wong
Characteristics of C4.5 Trees
•
•
•
•
Single coverage of training data (elegance)
Divide-and-conquer splitting strategy
Fragmentation problem
Locally reliable but globally un-significant rules
Missing many globally significant rules; mislead the system
Copyright © 2004 by Jinyan Li and Limsoon Wong
Copyright © 2004 by Jinyan Li and Limsoon Wong
Decision Tree
Ensembles
• Bagging
• Boosting
• Random forest
• Randomization trees
• CS4
Motivating Example
•
•
•
•
•
h1, h2, h3 are indep classifiers w/ accuracy = 60%
C1, C2 are the only classes
t is a test instance in C1
h(t) = argmaxC{C1,C2} |{hj {h1, h2, h3} | hj(t) = C}|
Then prob(h(t) = C1)
= prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C1) +
prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C2) +
prob(h1(t)=C1 & h2(t)=C2 & h3(t)=C1) +
prob(h1(t)=C2 & h2(t)=C1 & h3(t)=C1)
= 60% * 60% * 60% + 60% * 60% * 40% +
60% * 40% * 60% + 40% * 60% * 60%
= 64.8%
Copyright © 2004 by Jinyan Li and Limsoon Wong
Bagging
• Proposed by Breiman (1996)
• Also called Bootstrap aggregating
• Make use of randomness injected to training
data
Copyright © 2004 by Jinyan Li and Limsoon Wong
Main Ideas
Original training set
48 p + 52 n
50 p + 50 n
49 p + 51 n
…
53 p + 47 n
A base inducer such as C4.5
A committee H of classifiers:
h1
h2
Copyright © 2004 by Jinyan Li and Limsoon Wong
….
hk
Decision Making by Bagging
Given a new test sample T
Copyright © 2004 by Jinyan Li and Limsoon Wong
Boosting
• AdaBoost by Freund & Schapire (1995)
• Also called Adaptive Boosting
• Make use of weighted instances and weighted
voting
Copyright © 2004 by Jinyan Li and Limsoon Wong
Main Ideas
100 instances
with equal weight
A classifier h1
error
If error is 0
or >0.5 stop
Otherwise reweight: e1/(1-e1)
Renormalize to 1
Copyright © 2004 by Jinyan Li and Limsoon Wong
100 instances
with different weights
A classifier h2
error
Decision Making by AdaBoost.M1
Given a new test sample T
Copyright © 2004 by Jinyan Li and Limsoon Wong
Bagging vs Boosting
• Bagging
– Construction of Bagging classifiers are independent
– Equal voting
• Boosting
– Construction of a new Boosting classifier depends
on the performance of its previous classifier, i.e.
sequential construction (a series of classifiers)
– Weighted voting
Copyright © 2004 by Jinyan Li and Limsoon Wong
Random Forest
• Proposed by Breiman (2001)
• Similar to Bagging, but the base inducer is not
the standard C4.5
• Make use twice of randomness
Copyright © 2004 by Jinyan Li and Limsoon Wong
Main Ideas
Original training set
48 p + 52 n
50 p + 50 n
49 p + 51 n
…
53 p + 47 n
A base inducer (not C4.5 but revised)
A committee H of classifiers:
h1
h2
Copyright © 2004 by Jinyan Li and Limsoon Wong
….
hk
A Revised C4.5 as Base Classifier
Original n number of
features
Selection is from mtry
number of randomly
chosen features
Root
node
Copyright © 2004 by Jinyan Li and Limsoon Wong
Decision Making by Random Forest
Given a new test sample T
Copyright © 2004 by Jinyan Li and Limsoon Wong
Randomization Trees
• Proposed by Dietterich (2000)
• Make use of randomness in the selection of the
best split point
Copyright © 2004 by Jinyan Li and Limsoon Wong
Main Ideas
Original n number of
features
Root
node
Select one randomly from
{feature 1: choice 1,2,3
feature 2: choise 1, 2,
.
.
.
feature 8: choice 1, 2, 3
} Total 20 candidates
Equal voting on the committee of such decision trees
Copyright © 2004 by Jinyan Li and Limsoon Wong
CS4
• Proposed by Li et al (2003)
• CS4: Cascading and Sharing for decision trees
• Don’t make use of randomness
Copyright © 2004 by Jinyan Li and Limsoon Wong
Main Ideas
root nodes
total k trees
1
tree-1
2
tree-2
k
tree-k
Selection of root nodes is in a cascading manner!
Copyright © 2004 by Jinyan Li and Limsoon Wong
Decision Making by CS4
Not equal voting
Copyright © 2004 by Jinyan Li and Limsoon Wong
Summary of Ensemble Classifiers
Bagging
Random
Forest
AdaBoost.M1
Randomization
Trees
Copyright © 2004 by Jinyan Li and Limsoon Wong
CS4
Rules may
not be correct
when
applied to
training data
Rules correct
Any Question?
Copyright © 2004 by Jinyan Li and Limsoon Wong
Download