Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Part 4: Interesting Rules and Patterns Outline • Some interesting decision trees • Performance of CS4 • Demo Copyright © 2004 by Jinyan Li and Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong Some Interesting Decision Trees Decision Tree on a Prostate Data Set • Singh et.al, Cancer Cell 1:203-209, 2002 • 102 instances • 52 tumor samples • 50 normal samples • ~12,500 numeric features – Each one represents a gene (or probe) – Its value is expression level of that gene Copyright © 2004 by Jinyan Li and Limsoon Wong C4.5 Tree 32598_at <=29 33886_at <= 10 34950_at <=5 Tumor 40707_at > 10 Normal 6 <= -6 Tumor 3(+1) > -6 Normal >5 Normal >29 3(+1) Copyright © 2004 by Jinyan Li and Limsoon Wong Rule Translation • The tree can be translated into 5 rules • Two of them are significant rules, but the rest three are trivial • The two significant rules dominate in the two classes: normal class and tumor class Copyright © 2004 by Jinyan Li and Limsoon Wong 32598_at 33886_at Normal 40707_at Tumor 34950_at Normal Tumor Normal Significance of the Rules • Two significant rules – If x <= 29 and y <=10 and z <= 5, then this is a tumor cell (94%), where x, y, z • Three trivial rules: 12%, 6%, 6% 32598_at represent 32598_at, 33886_at, 34950_at respectively – If x > 29 and 40707_at > 6, then this is a normal cell (82%) 33886_at Normal 40707_at Tumor 34950_at Normal Tumor Copyright © 2004 by Jinyan Li and Limsoon Wong Normal Another Gene Expression Data Set • Yeoh et al., Cancer Cell 1:133-143, 2002 • Differentiating MLL subtype from other subtypes of childhood leukemia • Training data – 14 MLL vs 201 others • Test data – 6 MLL vs 106 others • Number of features – 12558 Copyright © 2004 by Jinyan Li and Limsoon Wong The Decision Tree 4 mistakes on test data Copyright © 2004 by Jinyan Li and Limsoon Wong Translating the Tree into a Mathematical Function Given a test sample, at most 3 of the 4 genes’ expression values are needed to make a decision! Copyright © 2004 by Jinyan Li and Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong Performance of CS4 Four Points to Demonstrate • Whether top-ranked features have similar gain ratios • Whether cascading trees have similar training performance • Whether the trees have similar structure • Whether the expanding tree committees can reduce the test errors gradually Copyright © 2004 by Jinyan Li and Limsoon Wong An Example For differentiation between the subtype Hyperdip>50 and some other subtypes of childhood leukemia Copyright © 2004 by Jinyan Li and Limsoon Wong Gain Ratios of Top 20 features • Gain ratios are: 0.39, 0.36, 0.35, 0.33, 0.33, 0.33, 0.33, 0.32, 0.31, 0.30; 0.30, 0.30, 0.30, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28. • The difference between the 1st and the 20th is only 0.11. In fact, the two features’ partitionings differ in a few samples Copyright © 2004 by Jinyan Li and Limsoon Wong Training and Test Performance Copyright © 2004 by Jinyan Li and Limsoon Wong Two Observations • The first tree does not always have the best performance • Alternative trees rooted by other top-ranked features may have better performance than the first tree Copyright © 2004 by Jinyan Li and Limsoon Wong The Power of Committee Copyright © 2004 by Jinyan Li and Limsoon Wong Compared to Bagging & Boosting • Bagging made similar number of mistakes: 2 mistakes • However, Boosting made 13 mistakes Copyright © 2004 by Jinyan Li and Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong Demo