Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Part 2: Rule-Based Approaches Outline • Overview of Supervised Learning • Decision Trees Ensembles – – – – – Bagging Boosting Random forest Randomization trees CS4 Copyright © 2004 by Jinyan Li and Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong Overview of Supervised Learning Computational Supervised Learning • Also called classification • Learn from past experience, and use the learned knowledge to classify new data • Knowledge learned by intelligent algorithms • Examples: – Clinical diagnosis for patients – Cell type classification Copyright © 2004 by Jinyan Li and Limsoon Wong Data • Classification application involves > 1 class of data. E.g., – Normal vs disease cells for a diagnosis problem • Training data is a set of instances (samples, points) with known class labels • Test data is a set of instances whose class labels are to be predicted Copyright © 2004 by Jinyan Li and Limsoon Wong Notation • Training data {x1, y1, x2, y2, …, xm, ym} where xj are n-dimensional vectors and yj are from a discrete space Y. E.g., Y = {normal, disease}. • Test data {u1, ?, u2, ?, …, uk, ?, } Copyright © 2004 by Jinyan Li and Limsoon Wong Process f(X) Training data: X Class labels Y A classifier, a mapping, a hypothesis f(U) Test data: U Copyright © 2004 by Jinyan Li and Limsoon Wong Predicted class labels Relational Representation of Gene Expression Data n features (order of 1000) gene1 gene2 gene3 gene4 … genen class m samples x11 x12 x13 x14 … x1n x21 x22 x23 x24 … x2n x31 x32 x33 x34 … x3n …………………………………. xm1 xm2 xm3 xm4 … xmn Copyright © 2004 by Jinyan Li and Limsoon Wong P N P N Features • Also called attributes • Categorical features – feature color = {red, blue, green} • Continuous or numerical features – gene expression – age – blood pressure • Discretization Copyright © 2004 by Jinyan Li and Limsoon Wong An Example Copyright © 2004 by Jinyan Li and Limsoon Wong Overall Picture of Supervised Learning Decision trees Emerging patterns SVM Neural networks Biomedical Financial Government Scientific Classifiers (M-Doctors) Copyright © 2004 by Jinyan Li and Limsoon Wong Evaluation of a Classifier • Performance on independent blind test data • K-fold cross validation: Given a dataset, divide it into k even parts, k-1 of them are used for training, and the rest one part treated as test data • LOOCV, a special case of K-fold CV • Accuracy, error rate • False positive rate, false negative rate, sensitivity, specificity, precision Copyright © 2004 by Jinyan Li and Limsoon Wong Requirements of Biomedical Classification • High accuracy • High comprehensibility Copyright © 2004 by Jinyan Li and Limsoon Wong Importance of Rule-Based Methods • Systematic selection of a small number of features used for decision making. Increase the comprehensibility of the knowledge patterns • C4.5 and CART are two commonly used rule induction algorithms, or called decision tree induction algorithms Copyright © 2004 by Jinyan Li and Limsoon Wong Structure of Decision Trees x1 Root node > a1 x2 x3 Internal nodes > a2 x4 A B A Leaf nodes B A • If x1 > a1 & x2 > a2, then it’s A class • C4.5, CART, two of the most widely used • Easy interpretation, but accuracy generally unattractive Copyright © 2004 by Jinyan Li and Limsoon Wong Elegance of Decision Trees A B B Copyright © 2004 by Jinyan Li and Limsoon Wong A A Brief History of Decision Trees CLS (Hunt etal. 1966)--- cost driven CART (Breiman et al. 1984) --- Gini Index ID3 (Quinlan, 1986 MLJ) --- Information-driven C4.5 (Quinlan, 1993) --- Gain ratio + Pruning ideas Copyright © 2004 by Jinyan Li and Limsoon Wong A Simple Dataset 9 Play samples 5 Don’t A total of 14. Copyright © 2004 by Jinyan Li and Limsoon Wong A Decision Tree sunny humidity <= 75 Play 2 outlook overcast > 75 Don’t 3 • NP-complete problem Copyright © 2004 by Jinyan Li and Limsoon Wong rain windy false Play Play 4 3 true Don’t 2 Construction of a Decision Tree • Determination of the root node of the tree and the root node of its sub-trees Copyright © 2004 by Jinyan Li and Limsoon Wong Most Discriminatory Feature • Every feature can be used to partition the training data • If the partitions contain a pure class of training instances, then this feature is most discriminatory Copyright © 2004 by Jinyan Li and Limsoon Wong Example of Partitions • Categorical feature – Number of partitions of the training data is equal to the number of values of this feature • Numerical feature – Two partitions Copyright © 2004 by Jinyan Li and Limsoon Wong Instance # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Sunny Sunny Sunny Sunny Sunny Overcast Overcast Overcast Overcast Rain Rain Rain Rain Rain Temp 75 80 85 72 69 72 83 64 81 71 65 75 68 70 Copyright © 2004 by Jinyan Li and Limsoon Wong Humidity 70 90 85 95 70 90 78 65 75 80 70 80 80 96 Windy true true false true false true false true false true true false false false class Play Don’t Don’t Don’t Play Play Play Play Play Don’t Don’t Play Play Play Total 14 training instances Outlook = sunny 1,2,3,4,5 P,D,D,D,P Outlook = overcast 6,7,8,9 P,P,P,P Outlook = rain 10,11,12,13,14 D, D, P, P, P Copyright © 2004 by Jinyan Li and Limsoon Wong Temperature <= 70 5,8,11,13,14 P,P, D, P, P Total 14 training instances Temperature 1,2,3,4,6,7,9,10,12 P,D,D,D,P,P,P,D,P > 70 Copyright © 2004 by Jinyan Li and Limsoon Wong Three Measures • Gini index • Information gain • Gain ratio Copyright © 2004 by Jinyan Li and Limsoon Wong Steps of Decision Tree Construction • Select the best feature as the root node of the whole tree • After partition by this feature, select the best feature (wrt the subset of training data) as the root node of this sub-tree • Recursively, until the partitions become pure or almost pure Copyright © 2004 by Jinyan Li and Limsoon Wong Characteristics of C4.5 Trees • • • • Single coverage of training data (elegance) Divide-and-conquer splitting strategy Fragmentation problem Locally reliable but globally un-significant rules Missing many globally significant rules; mislead the system Copyright © 2004 by Jinyan Li and Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong Decision Tree Ensembles • Bagging • Boosting • Random forest • Randomization trees • CS4 Motivating Example • • • • • h1, h2, h3 are indep classifiers w/ accuracy = 60% C1, C2 are the only classes t is a test instance in C1 h(t) = argmaxC{C1,C2} |{hj {h1, h2, h3} | hj(t) = C}| Then prob(h(t) = C1) = prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C1) + prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C2) + prob(h1(t)=C1 & h2(t)=C2 & h3(t)=C1) + prob(h1(t)=C2 & h2(t)=C1 & h3(t)=C1) = 60% * 60% * 60% + 60% * 60% * 40% + 60% * 40% * 60% + 40% * 60% * 60% = 64.8% Copyright © 2004 by Jinyan Li and Limsoon Wong Bagging • Proposed by Breiman (1996) • Also called Bootstrap aggregating • Make use of randomness injected to training data Copyright © 2004 by Jinyan Li and Limsoon Wong Main Ideas Original training set 48 p + 52 n 50 p + 50 n 49 p + 51 n … 53 p + 47 n A base inducer such as C4.5 A committee H of classifiers: h1 h2 Copyright © 2004 by Jinyan Li and Limsoon Wong …. hk Decision Making by Bagging Given a new test sample T Copyright © 2004 by Jinyan Li and Limsoon Wong Boosting • AdaBoost by Freund & Schapire (1995) • Also called Adaptive Boosting • Make use of weighted instances and weighted voting Copyright © 2004 by Jinyan Li and Limsoon Wong Main Ideas 100 instances with equal weight A classifier h1 error If error is 0 or >0.5 stop Otherwise reweight: e1/(1-e1) Renormalize to 1 Copyright © 2004 by Jinyan Li and Limsoon Wong 100 instances with different weights A classifier h2 error Decision Making by AdaBoost.M1 Given a new test sample T Copyright © 2004 by Jinyan Li and Limsoon Wong Bagging vs Boosting • Bagging – Construction of Bagging classifiers are independent – Equal voting • Boosting – Construction of a new Boosting classifier depends on the performance of its previous classifier, i.e. sequential construction (a series of classifiers) – Weighted voting Copyright © 2004 by Jinyan Li and Limsoon Wong Random Forest • Proposed by Breiman (2001) • Similar to Bagging, but the base inducer is not the standard C4.5 • Make use twice of randomness Copyright © 2004 by Jinyan Li and Limsoon Wong Main Ideas Original training set 48 p + 52 n 50 p + 50 n 49 p + 51 n … 53 p + 47 n A base inducer (not C4.5 but revised) A committee H of classifiers: h1 h2 Copyright © 2004 by Jinyan Li and Limsoon Wong …. hk A Revised C4.5 as Base Classifier Original n number of features Selection is from mtry number of randomly chosen features Root node Copyright © 2004 by Jinyan Li and Limsoon Wong Decision Making by Random Forest Given a new test sample T Copyright © 2004 by Jinyan Li and Limsoon Wong Randomization Trees • Proposed by Dietterich (2000) • Make use of randomness in the selection of the best split point Copyright © 2004 by Jinyan Li and Limsoon Wong Main Ideas Original n number of features Root node Select one randomly from {feature 1: choice 1,2,3 feature 2: choise 1, 2, . . . feature 8: choice 1, 2, 3 } Total 20 candidates Equal voting on the committee of such decision trees Copyright © 2004 by Jinyan Li and Limsoon Wong CS4 • Proposed by Li et al (2003) • CS4: Cascading and Sharing for decision trees • Don’t make use of randomness Copyright © 2004 by Jinyan Li and Limsoon Wong Main Ideas root nodes total k trees 1 tree-1 2 tree-2 k tree-k Selection of root nodes is in a cascading manner! Copyright © 2004 by Jinyan Li and Limsoon Wong Decision Making by CS4 Not equal voting Copyright © 2004 by Jinyan Li and Limsoon Wong Summary of Ensemble Classifiers Bagging Random Forest AdaBoost.M1 Randomization Trees Copyright © 2004 by Jinyan Li and Limsoon Wong CS4 Rules may not be correct when applied to training data Rules correct Any Question? Copyright © 2004 by Jinyan Li and Limsoon Wong