Part 4: Interesting rules and patterns

advertisement
Copyright © 2004 by Jinyan Li and Limsoon Wong
Rule-Based Data
Mining Methods for
Classification
Problems in
Biomedical Domains
Jinyan Li
Limsoon Wong
Copyright © 2004 by Jinyan Li and Limsoon Wong
Rule-Based Data
Mining Methods for
Classification
Problems in
Biomedical Domains
Part 4:
Interesting Rules and Patterns
Outline
• Some interesting decision trees
• Performance of CS4
• Demo
Copyright © 2004 by Jinyan Li and Limsoon Wong
Copyright © 2004 by Jinyan Li and Limsoon Wong
Some Interesting
Decision Trees
Decision Tree on a Prostate Data Set
• Singh et.al, Cancer Cell
1:203-209, 2002
• 102 instances
• 52 tumor samples
• 50 normal samples
• ~12,500 numeric
features
– Each one represents a
gene (or probe)
– Its value is expression
level of that gene
Copyright © 2004 by Jinyan Li and Limsoon Wong
C4.5 Tree
32598_at
<=29
33886_at
<= 10
34950_at
<=5
Tumor
40707_at
> 10
Normal
6
<= -6
Tumor
3(+1)
> -6
Normal
>5
Normal
>29
3(+1)
Copyright © 2004 by Jinyan Li and Limsoon Wong
Rule Translation
• The tree can be
translated into 5 rules
• Two of them are
significant rules, but the
rest three are trivial
• The two significant rules
dominate in the two
classes: normal class
and tumor class
Copyright © 2004 by Jinyan Li and Limsoon Wong
32598_at
33886_at
Normal
40707_at
Tumor
34950_at
Normal
Tumor
Normal
Significance of the Rules
• Two significant rules
– If x <= 29 and y <=10 and
z <= 5, then this is a tumor
cell (94%), where x, y, z
• Three trivial rules: 12%,
6%, 6%
32598_at
represent 32598_at, 33886_at,
34950_at respectively
– If x > 29 and 40707_at > 6, then this is a normal
cell (82%)
33886_at
Normal
40707_at
Tumor
34950_at
Normal
Tumor
Copyright © 2004 by Jinyan Li and Limsoon Wong
Normal
Another Gene Expression Data Set
• Yeoh et al., Cancer Cell
1:133-143, 2002
• Differentiating MLL
subtype from other
subtypes of childhood
leukemia
• Training data
– 14 MLL vs 201 others
• Test data
– 6 MLL vs 106 others
• Number of features
– 12558
Copyright © 2004 by Jinyan Li and Limsoon Wong
The Decision Tree
4 mistakes on
test data
Copyright © 2004 by Jinyan Li and Limsoon Wong
Translating the Tree into a
Mathematical Function
Given a test sample, at most 3 of
the 4 genes’ expression values are
needed to make a decision!
Copyright © 2004 by Jinyan Li and Limsoon Wong
Copyright © 2004 by Jinyan Li and Limsoon Wong
Performance of CS4
Four Points to Demonstrate
• Whether top-ranked features have similar gain
ratios
• Whether cascading trees have similar training
performance
• Whether the trees have similar structure
• Whether the expanding tree committees can
reduce the test errors gradually
Copyright © 2004 by Jinyan Li and Limsoon Wong
An Example
For differentiation between the
subtype Hyperdip>50 and some
other subtypes of childhood
leukemia
Copyright © 2004 by Jinyan Li and Limsoon Wong
Gain Ratios of Top 20 features
• Gain ratios are: 0.39, 0.36, 0.35, 0.33, 0.33,
0.33, 0.33, 0.32, 0.31, 0.30; 0.30, 0.30, 0.30,
0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28.
• The difference between the 1st and the 20th is
only 0.11. In fact, the two features’ partitionings
differ in a few samples
Copyright © 2004 by Jinyan Li and Limsoon Wong
Training and Test Performance
Copyright © 2004 by Jinyan Li and Limsoon Wong
Two Observations
• The first tree does not always have the best
performance
• Alternative trees rooted by other top-ranked
features may have better performance than the
first tree
Copyright © 2004 by Jinyan Li and Limsoon Wong
The Power of Committee
Copyright © 2004 by Jinyan Li and Limsoon Wong
Compared to Bagging & Boosting
• Bagging made similar number of mistakes: 2
mistakes
• However, Boosting made 13 mistakes
Copyright © 2004 by Jinyan Li and Limsoon Wong
Copyright © 2004 by Jinyan Li and Limsoon Wong
Demo
Download