CENG 464 Introduction to Data Mining Supervised vs. Unsupervised Learning • Supervised learning (classification) – Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations – New data is classified based on the training set • Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 2 Classification: Definition • Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 3 Classification: Definition Tid 10 4 Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Prediction Problems: Classification vs. Numeric Prediction • Classification : – predicts categorical class labels (discrete or nominal) – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Numeric Prediction – models continuous-valued functions, i.e., predicts unknown or missing values • Typical applications – Credit/loan approval: – Medical diagnosis: if a tumor is cancerous or benign – Fraud detection: if a transaction is fraudulent – Web page categorization: which category it is 5 Classification—A Two-Step Process • Model construction: describing a set of predetermined classes – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute – The set of tuples used for model construction is training set – The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects – Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set (otherwise overfitting) – If the accuracy is acceptable, use the model to classify new data • Note: If the test set is used to select models, it is called validation (test) set 6 Process (1): Model Construction Classification Algorithms Training Data NAME RANK M ike M ary B ill Jim D ave Anne 7 A ssistan t P ro f A ssistan t P ro f P ro fesso r A sso ciate P ro f A ssistan t P ro f A sso ciate P ro f YEARS TENURED 3 7 2 7 6 3 no yes yes yes no no Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Process (2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME Tom M erlisa G eorge Joseph 8 RANK Y E A R S TE N U R E D A ssistant P rof 2 no A ssociate P rof 7 no P rofessor 5 yes A ssistant P rof 7 yes Tenured? Illustrating Classification Task Training and Test set are randomly sampled Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes supervised Learning algorithm Induction Learn Model Model 10 Training Set Attrib2 Attrib3 accuracy Tid Attrib1 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Deduction 10 Test Set 9 Find a mapping OR function that can predict class label of given tuple X Classification Techniques • Decision Tree based Methods • Bayes Classification Methods • • • • • 10 Rule-based Methods Nearest-Neighbor Classifier Artificial Neural Networks Support Vector Machines Memory based reasoning Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Root node: Internal nodes: attribute test conditions Leaf nodes: class label Splitting Attributes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO NO > 80K YES 10 Training Data 11 Married Model: Decision Tree Another Example of Decision Tree MarSt Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 12 60K Married NO Single, Divorced Refund No Yes NO TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tree Induction algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set 13 Attrib3 Apply Model Class Deduction Decision Tree Apply Model to Test Data Test Data Start from the root of tree. Refund Yes 10 No NO MarSt Single, Divorced TaxInc < 80K NO 14 Married NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes 10 No NO MarSt Single, Divorced TaxInc < 80K NO 15 Married NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes 10 No NO MarSt Single, Divorced TaxInc < 80K NO 16 Married NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes 10 No NO MarSt Single, Divorced TaxInc < 80K NO 17 Married NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes 10 No NO MarSt Single, Divorced TaxInc < 80K NO 18 Married NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO 19 Refund Marital Status Married NO > 80K YES Assign Cheat to “No” Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tree Induction algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set 20 Attrib3 Apply Model Class Deduction Decision Tree Decision Tree Induction • Many Algorithms: – Hunt’s Algorithm – ID3, C4.5 – CART – SLIQ,SPRINT 21 Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they are discretized in advance) – Examples are partitioned recursively based on selected attributes – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left 22 Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. • Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? – Determine when to stop splitting 23 How to Specify Test Condition? • Depends on attribute types – Nominal – Ordinal – Continuous • Depends on number of ways to split – 2-way split – Multi-way split 24 Splitting Based on Nominal Attributes • Multi-way split: Use as many partitions as distinct values. CarType Family Luxury Sports • Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} 25 CarType {Family} OR {Family, Luxury} CarType {Sports} Splitting Based on Ordinal Attributes • Multi-way split: Use as many partitions as distinct values. Size Small Large Medium • Binary split: Divides values into two subsets. Need to find optimal partitioning. {Small, Medium} Size {Large} • What about this split? 26 OR {Small, Large} {Medium, Large} Size Size {Medium} {Small} Splitting Based on Continuous Attributes • Different ways of handling – Discretization to form an ordinal categorical attribute • Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. – Binary Decision: (A < v) or (A v) • consider all possible splits and finds the best cut • can be more compute intensive 27 Splitting Based on Continuous Attributes Taxable Income > 80K? Taxable Income? < 10K Yes > 80K No [10K,25K) (i) Binary split 28 [25K,50K) [50K,80K) (ii) Multi-way split How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Own Car? Yes Car Type? No Family Student ID? Luxury c1 Sports C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 Which test condition is the best? 29 C0: 1 C1: 0 ... c10 C0: 1 C1: 0 c11 C0: 0 C1: 1 c20 ... C0: 0 C1: 1 How to determine the Best Split • Greedy approach: – Nodes with homogeneous class distribution are preferred • Need a measure of node impurity: C0: 5 C1: 5 30 C0: 9 C1: 1 Non-homogeneous, Homogeneous, High degree of impurity Low degree of impurity Attribute Selection-Splitting Rules Measures (Measures of Node Impurity) Provides a ranking for each attribute describing the given training tuples. The attribute having the best score for the measure is chosen as the splitting attribute for the given tuples. • Information Gain-Entropy • Gini Index • Misclassification error 31 Brief Review of Entropy m=2 32 Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest information gain This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or impurity in these partitions Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D: m Info( D) pi log2 ( pi ) i 1 Information needed (after using A to split D into v partitions) to classify D: v | Dj | j 1 | D| InfoA ( D) Information gained by branching on attribute A Info( D j ) Gain(A) Info(D) InfoA(D) 33 Attribute Selection: Information Gain Class P: buys_computer = “yes” Class N: buys_computer = “no” Info ( D) I (9,5) Infoage ( D ) 9 9 5 5 log 2 ( ) log 2 ( ) 0.940 14 14 14 14 5 I (3,2) 0.694 14 5 I ( 2,3) means “youth” has 5 out of 14 14 samples, with 2 yes’es and 3 no’s. age pi ni I(pi, ni) youth 2 3 0.971 middle-aged 4 0 0 senior 3 2 0.971 age income student credit_rating youth high no fair youth high no excellent middle-aged high no fair senior medium no fair senior low yes fair senior low yes excellent middle-aged low yes excellent youth medium no fair youth low yes fair senior medium yes fair youth medium yes excellent middle-aged medium no excellent missle-aged high yes fair 34 senior medium no excellent 5 4 I ( 2,3) I ( 4,0) 14 14 Hence buys_computer no no yes yes yes no yes no yes yes yes yes yes no Gain(age) Info(D) Infoage (D) 0.246 Similarly, Gain(income) 0.029 Gain( student ) 0.151 Gain(credit _ rating ) 0.048 Computing Information-Gain for Continuous-Valued Attributes • Let attribute A be a continuous-valued attribute • Must determine the best split point for A – Sort the value A in increasing order – Typically, the midpoint between each pair of adjacent values is considered as a possible split point • (ai+ai+1)/2 is the midpoint between the values of ai and ai+1 – The point with the minimum expected information requirement for A is selected as the split-point for A • Split: 35 – D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point Gain Ratio for Attribute Selection (C4.5) • Information gain measure is biased towards attributes with a large number of values • C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) v SplitInfoA ( D) j 1 | Dj | | D| log2 ( | Dj | | D| ) – GainRatio(A) = Gain(A)/SplitInfo(A) • Ex. – gain_ratio(income) = 0.029/1.557 = 0.019 • The attribute with the maximum gain ratio is selected as the splitting attribute 36 Gini Index (CART, IBM IntelligentMiner) • If a data set D contains examples from n classes, gini index, gini(D) is defined as n gini ( D) 1 p 2j j 1 where pj is the relative frequency of class j in D • If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as |D1| |D2 | gini ( D1) gini ( D 2) gini A ( D) |D| |D| • Reduction in Impurity: gini( A) gini(D) giniA (D) • The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute) 37 Computation of Gini Index • Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” 2 2 9 5 gini( D) 1 0.459 14 14 • Suppose the attribute income partitions D into 10 in D1: {low, 10 4 medium} and 4 in D2 gini ( D) Gini( D ) Gini( D ) income{low, medium} 14 1 14 2 Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index • All attributes are assumed continuous-valued • May need other tools, e.g., clustering, to get the possible split values • Can be modified for categorical attributes 38 Comparing Attribute Selection Measures • The three measures, in general, return good results but – Information gain: • biased towards multivalued attributes – Gain ratio: • tends to prefer unbalanced splits in which one partition is much smaller than the others – Gini index: • biased to multivalued attributes • has difficulty when # of classes is large • tends to favor tests that result in equal-sized partitions and purity in both partitions 39 Other Attribute Selection Measures • CHAID: a popular decision tree algorithm, measure based on χ2 test for independence • C-SEP: performs better than info. gain and gini index in certain cases • G-statistic: has a close approximation to χ2 distribution • MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): – The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree • Multivariate splits (partition based on multiple variable combinations) – CART: finds multivariate splits based on a linear comb. of attrs. • Which attribute selection measure is the best? – Most give good results, none is significantly superior than others 40 Overfitting and Tree Pruning • Overfitting: An induced tree may overfit the training data – Too many branches, some may reflect anomalies due to noise or outliers – Poor accuracy for unseen samples • Two approaches to avoid overfitting – Prepruning: Halt tree construction early ̵ do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold – Postpruning: Remove branches from a “fully grown” tree— get a sequence of progressively pruned trees • Use a set of data different from the training data to decide which is the “best pruned tree” 41 Decision Tree Based Classification • Advantages: – Inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – Accuracy is comparable to other classification techniques for many simple data sets 42 Chapter 8. Classification: Basic Concepts • Classification: Basic Concepts • Decision Tree Induction • Bayes Classification Methods • Rule-Based Classification • Model Evaluation and Selection • Techniques to Improve Classification Accuracy: Ensemble Methods • Summary 43 Bayesian Classification: Why? • A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities • Foundation: Based on Bayes’ Theorem. • Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured 44 Bayes’ Theorem: Basics • Total probability Theorem: P(B) M P(B | A )P( A ) i i i 1 • Bayes’ Theorem: P(H | X) P(X | H )P(H ) P(X | H ) P(H ) / P(X) P(X) 45 – Let X be a data sample (“evidence”): class label is unknown – Let H be a hypothesis that X belongs to class C – Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the hypothesis holds given the observed data sample X – P(H) (prior probability): the initial probability • E.g., X will buy computer, regardless of age, income, … – P(X): probability that sample data is observed – P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds • E.g., Given that X will buy computer, the prob. that X is 31..40, medium income Prediction Based on Bayes’ Theorem • Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes’ theorem P(H | X) P(X | H )P(H ) P(X | H ) P(H ) / P(X) P(X) • Informally, this can be viewed as posteriori = likelihood x prior/evidence • Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes • Practical difficulty: It requires initial knowledge of many probabilities, involving significant computational cost 46 Classification Is to Derive the Maximum Posteriori • Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn) • Suppose there are m classes C1, C2, …, Cm. • Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) • This can be derived from Bayes’ theorem P(X | C )P(C ) i i P(C | X) i P(X) • Since P(X) is constant for all classes, only P(C | X) P(X | C )P(C ) i i i needs to be maximized 47 Naïve Bayes Classifier • A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): n P( X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i) k 1 2 n k 1 • This greatly reduces the computation cost: Only counts the class distribution • If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) • If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is g ( x, , ) 1 e 2 ( x )2 2 2 P(X | Ci) g ( xk , Ci , Ci ) 48 Naïve Bayes Classifier: Training Dataset Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data to be classified: X = (age =youth, Income = medium, Student = yes Credit_rating = Fair) 49 age income student credit_rating buys_computer youth high no fair no youth high no excellent no middle-aged high no fair yes senior medium no fair yes senior low yes fair yes senior low yes excellent no middle-aged low yes excellent yes youth medium no fair no youth low yes fair yes senior medium yes fair yes youth medium yes excellent yes middle-aged medium no excellent yes missle-aged high yes fair yes senior medium no excellent no Naïve Bayes Classifier: An Example age income student credit_rating youth high no fair youth high no excellent middle-aged high no fair senior medium no fair senior low yes fair senior low yes excellent middle-aged low yes excellent youth medium no fair youth low yes fair senior medium yes fair youth medium yes excellent middle-aged medium no excellent missle-aged high yes fair senior medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no • P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357 • Compute P(X|Ci) for each class P(age = “youth” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “youth ” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 • X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”) 50 Avoiding the Zero-Probability Problem • Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero P( X | C i) n P( x k | C i) k 1 • Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10) • Use Laplacian correction (or Laplacian estimator) – Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 – The “corrected” prob. estimates are close to their “uncorrected” counterparts 51 Naïve Bayes Classifier: Comments • Advantages – Easy to implement – Robust to noise – Can handle null values – Good results obtained in most of the cases • Disadvantages – Assumption: class conditional independence, therefore loss of accuracy – Practically, dependencies exist among variables • E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. • Dependencies among these cannot be modeled by Naïve Bayes Classifier • How to deal with these dependencies? Bayesian Belief Networks 52 Chapter 8. Classification: Basic Concepts • Classification: Basic Concepts • Decision Tree Induction • Bayes Classification Methods • Rule-Based Classification • Model Evaluation and Selection • Techniques to Improve Classification Accuracy: Ensemble Methods • Summary 53 Using IF-THEN Rules for Classification • Represent the knowledge in the form of IF-THEN rules R: – – – IF age = youth AND student = yes THEN buys_computer = yes Rule antecedent/precondition vs. rule consequent If rule is satisfied by X, it covers the tupple, the rule is said to be triggered If R1 is the rule satisfied, then the rule fires by returning the class predictiın • Assessment of a rule: coverage and accuracy – ncovers = # of tuples covered by R – ncorrect = # of tuples correctly classified by R coverage(R) = ncovers /|D| accuracy(R) = ncorrect / ncovers 54 where D: training data set Using IF-THEN Rules for Classification • If more than one rule are triggered, need conflict resolution – Size ordering: assign the highest priority to the triggering rules that has the “toughest” requirement (i.e., with the most attribute tests) – Rule ordering: prioritize rules beforehand • Class-based ordering: classes are sorted in order of decreasing importance like order of prevalence or misclassification cost per class. Within each class rules are nor ordered • Rule-based ordering (decision list): rules are organized into one long priority list, according to some measure of rule quality like accuracy, coverage or size. The first rule satisfying X fires class prediction, any other rule satisfying X is ignored. Each rule in the list implies the negation of the rules that come before itdifficult to interpret • What if no rule is fired for X? default rule! 55 Rule Extraction from a Decision Tree Rules are easier to understand than large trees One rule is created for each path from the root to a leaf and logically ANDed to form the rule antecedent Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction Rules are mutually exclusive and exhaustive Mutually exclusive: no two rules will be triggered for the same tuple Exhaustive: there is one rule for each possible attribute value combinationno need for a default rule 56 Rule Extraction from a Decision Tree age? youth middle_aged student? no no yes yes senior credit rating? excellent yes • Example: Rule extraction from our buys_computer decision-tree IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = middle_aged THEN buys_computer = yes IF age = senior AND credit_rating = excellent THEN buys_computer = no IF age = senior AND credit_rating = fair THEN buys_computer = yes 57 fair yes Rule Induction: Sequential Covering Method • Sequential covering algorithm: Extracts rules directly from training data • Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER • Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci but none (or few) of the tuples of other classes • Steps: – Rules are learned one at a time – Each time a rule is learned, the tuples covered by the rules are removed – Repeat the process on the remaining tuples until termination condition, e.g., when no more training examples or when the quality of a rule returned is below a user-specified threshold • Comp. w. decision-tree induction: learning a set of rules simultaneously 58 Sequential Covering Algorithm When learning a rule for a class, C, we would like the rule to cover all or most of the training tuples of class C and none or few of the tuples from other classes while (enough target tuples left) generate a rule remove positive target tuples satisfying this rule Examples covered by Rule 2 Examples covered by Rule 1 Examples covered by Rule 3 Positive examples 59 How to Learn-One-Rule? Two approaches: Specialization: • Start with the most general rule possible: empty ruleclass y • Best attribute-value pair is added from list A into the antecedent • Continue until rule performance measure cannot improve further – If income=high THEN loan_decision=accept – If income=high AND credit_rating=excellent THEN loan_decision=accept – Greedy algorithm: always add attribute –value pair which is best at the moment 60 How to Learn-One-Rule? Two approaches: generalization • Start with the randomly selected positive tuple and converted to a rule that covers – Tuple: (overcast, high,false,P) can be converted to a rule as Outlook=overcast AND humidity=high AND windy=false class=P • Choose one attribute-value pair and remove it sothat rule covers more positive examples • Repeat the process until the rule starts to cover negative examples 61 How to Learn-One-Rule? Rule-Quality measures: used to decide if appending a test to the rule’s condition will result in an improved rule: accuracy, coverage •Consider R1 correctly classifies 38 0f 40 tuples whereas R2 covers 2 tuples and correctly classifies all: which rule is better? Accuracy? •Different Measures: Foil-gain, likelihood ratio statistics, chisquare statistics 62 How to Learn-One-Rule? Rule-Quality measures: Foil-gain: checks if ANDing a new condition results in a better rule • considers both coverage and accuracy – Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition FOIL _ Gain pos'(log2 pos' pos log2 ) pos' neg' pos neg pos and neg are the # of positively and negatively covered tuples by R and Pos’ and neg’ are the # of positively and negatively covered tuples by R’ • favors rules that have high accuracy and cover many positive tuples • No test set for evaluating rules but Rule pruning is performed by removing a condition pos neg FOIL_ Prune( R) pos neg Pos/neg are # of positive/negative tuples covered by R. If FOIL_Prune is higher for the pruned version of R, prune R 63 Nearest Neighbour Approach • General Idea – The Model: a set of training examples stored in memory – Lazy Learning: delaying the decision to the time of classification. In other words, there is no training! – To classify an unseen record: compute its proximity to all training examples and locate 1 or k nearest neighbours examples. The nearest neighbours determine the class of the record (e.g. majority vote) – Rationale: “If it walks like a duck, quacks like a duck, and looks like a duck, it probably is a duck”. Nearest Neighbour Approach • kNN Classification Algorithm algorithm kNN (Tr: training set; k : integer; r : data record) : Class begin for each training example t in Tr do calculate proximity d(t, r) upon descriptive attributes end for; select the top k nearest neighbours into set D accordingly; Class := majority class in D return Class Class():■ end; Nearest Neighbour Approach • PEBLS Algorithm – – – – – – Class based similarity measure is used A nearest neighbour algorithm (k = 1) Examples in memory have weights (exemplars) Simple training: assigning and refining weights A different proximity measure Algorithm outline: 1. 2. 3. Build value difference tables for descriptive attributes (in preparation of measuring distances between examples) For each training, refine the weight of its nearest neighbour Refine the weights of some training examples when classifying validation examples Nearest Neighbour Approach • PEBLS: Value Difference Table attribute A A1 A2 … Ai …. Am A1 A2 d(A1, A1) d(A1, A2) d(A2, A1) d(A2, A2) d(Ai, A1) … Aj d(A1, Aj) d(A2, Aj) … Am d(A1, Am) d(A2, Am) d(Ai, A2) d(Ai, Aj) d(Ai, Am) d(Am, A1) d(Am, A2) d(Am, Aj) d(Am, Am) k d (V1 , V2 ) Civ1 C t 1 v1 Civ2 r C v2 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal high normal high Windy FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE Class N N P P P N P N P P P P P N Value Difference Table for Outlook r: is set to 1. Cv1: total number of examples with V1 Cv2: total number of examples with V2 Civ1: total number of examples with V1 and of class i Civ2: total number of examples with V2 and of class i sunny overcast rain sunny 0 1.2 0.4 d ( sunny, overcast ) overcast 1.2 0 0.8 rain 0.4 0.8 0 24 30 33 1 .2 5 4 5 4 5 5 Nearest Neighbour Approach • PEBLS: Distance Function m ( X , Y ) wX wY d ( xi , yi ) 2 i 1 where wX, wY: weights for X and Y, m: the number of attributes, xi, yi: values of the ith attribute for X and Y. wX T C where T: the total number of times that X is selected as the nearest neighbour, C: the total number of times that X correctly classifies examples. Nearest Neighbour Approach • PEBLS: Distance Function (Example) ROW# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal high normal high Windy FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE Class N N P P P N P N P P P P P N Value Difference Tables Outlook sunny overcast rain sunny 0 1.2 0.4 Temperature hot mild cool hot 0 0.33 0.5 overcast 1.2 0 0.8 mild 0.33 0 0.33 rain 0.4 0.8 0 cool 0.5 0.33 0 Windy TRUE FALSE TRUE 0 0.418 FALSE 0.418 0 Humidity high normal high 0 0.857 normal 0.857 0 Assuming row1.weight = row2.weight = 1, (row1, row2) = d(row1outlook, row2outlook)2 + d(row1temperature, row2temperature)2 + d(row1humidity, row2humidity)2 + d(row1windy, row2windy)2 = d(sunny, sunny)2 + d(hot, hot)2 + d(high, high)2 + d(false, true)2 = 0 + 0 + 0 + (1/2)2 =1/4 Nearest Neighbour Approach • PEBLS: Example 2 1 sunny hot high false N 2 2 sunny hot high true N 1 2 sunny hot high true N 1 3 overcast hot high false P 1 3 overcast hot high false P 1 4 rain mild false P 1.5 4 rain mild false P 1.5 5 rain cool normal false P 1.5 5 rain cool normal false P 1.5 6 rain cool normal true N 2 6 rain cool normal true N 2 7 overcast cool normal true P 1 7 overcast cool normal true P 1 8 sunny mild high false N 2 8 sunny mild high false N 2 9 sunny cool normal false P 1 9 sunny cool normal false P 1 10 rain 1 10 rain 1 mild normal false P 11 sunny mild normal true P 1 12 overcast mild high true P 2 13 overcast hot normal false P 1 14 rain mild high 1 true N overcast hot high false high Unseen record high mild normal false P 11 sunny mild normal true P 1 12 overcast mild high true P 2 13 overcast hot normal false P 1 14 rain mild high 1 true N overcast hot high false false N ? 1 sunny hot high P Classifying an unseen record After Training: Artificial Neural Network Approach – Our brains are made up of about 100 billion tiny units called neurons. – Each neuron is connected to thousands of other neurons and communicates with them via electrochemical signals. – Signals coming into the neuron are received via junctions called synapses, these in turn are located at the end of branches of the neuron cell called dendrites. – The neuron continuously receives signals from these inputs – What the neuron does is sum up the inputs to itself in some way and then, if the end result is greater than some threshold value, the neuron fires. – It generates a voltage and outputs a signal along something called an axon. Artificial Neural Network Approach • General Idea – The Model: A network of connected artificial neurons – Training: select a specific network topology and use the training example to tune the weights attached on the links connecting the neurons – To classify an unseen record X, feed the descriptive attribute values of the record into the network as inputs. The network computes an output value that can be converted to a class label Artificial Neural Network Approach • Artificial Neuron (Unit) i1 i2 w1 w2 w3 S y i3 Sum function: Transformation function: x = w1*i1 + w2*i2 + w3*i3 Sigmoid(x) = 1/(1+e-x) Y Sigm oid Function X Artificial Neural Network Approach A neural network can have many hidden layers, but one layer is normally considered sufficient The more units a hidden layer has, the more capacity of pattern recognition The constant inputs can be fed into the units in the hidden and output layers as inputs. Network with links from lower layers to upper layersfeed-forward nw Network with links between nodes of the same layerrecurrent nw Artificial Neural Network Approach • Artificial Neuron (Perceptron) i1 i2 w1 w2 w3 S y i3 Sum function: Transformation function: x = w1*i1 + w2*i2 + w3*i3 Sigmoid(x) = 1/(1+e-x) Y Sigm oid Function X Artificial Neural Network Approach • General Principle for Training an ANN algorithm trainNetwork (Tr: training set) : Network Begin R = initial network with a particular topology; initialise the weight vector with random values w(0); repeat for each training example t=<xi, yi> in Tr do compute the predicted class output ŷ(k) for each weight wj in the weight vector do update the weight wj: wj(k+1) := wj(k) + (yi - ŷ(k))xij end for; end for; until stopping criterion is met return R end; : the learning factor. The more the value is, the bigger amount weight changes. Artificial Neural Network Approach • Using ANN for Classification – Multiple hidden layers: • Do not know the actual class value and hence difficult to adjust the weight • Solution: Back-propagation (layer by layer from the output layer) – Model Overfitting: use validation examples to further tune the weights in the network – Descriptive attributes should be normalized or converted to binary – Training examples are used repeatedly. The training cost is therefore very high. – Difficulty in explaining classification decisions Artificial Neural Network Approach • Network Topology – # of nodes in input layer: determined by # and data types of attributes: • Continuous and binary attributes: 1 node for each attribute • categorical attribute: convert to numeric or binary – Attribute w k labels needs at least log k nodes – # of nodes in output layer: determined by # of classess • For 2 class solution 1 node • K class solution at least log k nodes – # of hidden layers and nodes in the hidden layers: difficult to decide – in NWs with hidden laeyrs: updating weights using backpropagation Model Evaluation and Selection • Evaluation metrics: How can we measure accuracy? Other metrics to consider? • Use validation test set of class-labeled tuples instead of training set when assessing accuracy • Methods for estimating a classifier’s accuracy: – Holdout method, random subsampling – Cross-validation – Bootstrap • Comparing classifiers: – Confidence intervals – Cost-benefit analysis and ROC Curves 79 Classifier Evaluation Metrics: Confusion Matrix Confusion Matrix: Actual class\Predicted class yes no yes True Positives (TP) False Negatives (FN) no False Positives (FP) True Negatives (TN) Example of Confusion Matrix: Actual class\Predicted buy_computer buy_computer class = yes = no Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000 • TP and TN are the correctly predicted tuples • May have extra rows/columns to provide totals 80 Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity A\P Y N Class Imbalance Problem: Y TP FN P One class may be rare, e.g. N FP TN N fraud, or HIV-positive P’ N’ All Significant majority of the negative class and minority of • Classifier Accuracy, or the positive class recognition rate: percentage of test set tuples that are correctly Sensitivity: True Positive classified recognition rate Accuracy = (TP + TN)/All Sensitivity = TP/P • Error rate:misclassification rate Specificity: True Negative recognition rate 1 – accuracy, or Specificity = TN/N Error rate = (FP + FN)/All 81 Classifier Evaluation Metrics: Precision and Recall, and F-measures • Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive • Recall: completeness – what % of positive tuples did the classifier label as positive? • Perfect score is 1.0 • Inverse relationship between precision & recall • F measure (F1 or F-score): harmonic mean of precision and recall, • Fß: weighted measure of precision and recall – assigns ß times as much weight to recall as to precision 82 Classifier Evaluation Metrics: Example – Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%) cancer = yes 90 210 300 30.00 (sensitivity cancer = no 140 9560 9700 98.56 (specificity) Total 230 9770 10000 96.40 (accuracy) Precision = ?? Recall = ?? 83 Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods • Holdout method – Given data is randomly partitioned into two independent sets • Training set (e.g., 2/3) for model construction • Test set (e.g., 1/3) for accuracy estimation – Random sampling: a variation of holdout • Repeat holdout k times, accuracy = avg. of the accuracies obtained • Cross-validation (k-fold, where k = 10 is most popular) – Randomly partition the data into k mutually exclusive subsets, each approximately equal size – At i-th iteration, use Di as test set and others as training set – Leave-one-out: k folds where k = # of tuples, for small sized data 84 Evaluating Classifier Accuracy: Bootstrap • Bootstrap – Works well with small data sets – Samples the given training tuples uniformly with replacement • i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set – Examples used for training set can be used for test set too 85 Ensemble Methods: Increasing the Accuracy • Ensemble methods – Use a combination of models to increase accuracy – Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M* • Popular ensemble methods – Bagging, boosting, Ensemble 86 Classification of Class-Imbalanced Data Sets • Class-imbalance problem: Rare positive example but numerous negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc. • Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for class-imbalanced data • Typical methods for imbalance data in 2-class classification: – Oversampling: re-sampling of data from positive class – Under-sampling: randomly eliminate tuples from negative class 87 Model Selection: ROC Curves • • • • • • ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models Originated from signal detection theory Shows the trade-off between the true positive rate and the false positive rate The area under the ROC curve is a measure of the accuracy of the model Diagonal line: for every TP, equally likely to encounter FP The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model Vertical axis represents the true positive rate Horizontal axis rep. the false positive rate The plot also shows a diagonal line A model with perfect accuracy will have an area of 1.0 88 Issues Affecting Model Selection • Accuracy – classifier accuracy: predicting class label • Speed – time to construct the model (training time) – time to use the model (classification/prediction time) • Robustness: handling noise and missing values • Scalability: efficiency in disk-resident databases • Interpretability – understanding and insight provided by the model • Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 89 Comparison of Techniques • Comparison of Approaches Classification Approaches Decision Tree Nearest Neighbours Rul-base Model Interpretability Model maintenability Training Cost Classifcation Cost Artificial Neural Network Bayesian Classifier Model Interpretability: ease of understanding classification decisions Model maintenability: ease of modifying the model in the presence of new training examples Training cost: computational cost for building a model Classification cost: computational cost for classifying an unseen record Comparison of Techniques • Comparison of Approaches Classification Approaches Decision Tree Nearest Neighbours Rul-base Artificial Neural Network Bayesian Classifier Model Interpretability High Model maintenability Low Training Cost High Classifcation Cost Low Medium High Low High High Low High Medium Low Low High Low Medium Medium Medium Low Model Interpretability: ease of understanding classification decisions Model maintenability: ease of modifying the model in the presence of new training examples Training cost: computational cost for building a model Classification cost: computational cost for classifying an unseen record Decision Tree Induction in Weka • Overview – – – – – – – – ID3 (only work for categorical attributes) J48 (Java implementation of C4.5) RandomTree (with K attributes) RandomForest (a forest of random trees) REPTree (regression tree with reduced error pruning) BFTree (best-first tree, using Gain or Gini) FT (functional tree, logistic regression as split nodes) SimpleCart (CART with cost-complexity pruning) Decision Tree Induction in Weka • Preparation Pre-processing attributes if necessary Specifying the class attribute Selecting attributes Decision Tree Induction in Weka • Constructing Classification Models (ID3) 1. Choosing a method and setting parameters 2. Setting a test option 3. Starting the process 5. Selecting the option to view the tree 4. View the model and evaluation results Decision Tree Induction in Weka • J48 (unpruned tree) Decision Tree Induction in Weka • RandomTree Decision Tree Induction in Weka • Classifying Unseen Records 1. Preparing unseen records in an ARFF file Class values are left as “unknown” (“?”) Decision Tree Induction in Weka • Classifying Unseen Records 2. Classifying unseen records in the file 1.Selecting this option and click Set… button 3.Press to start the classification 2.Press the button and load the file Decision Tree Induction in Weka • Classifying Unseen Records 3. Saving Classification Results into a file 2.Setting both X and Y to instance_number 3.Saving the results into a file 1.Selecting the option to pop up visualisation Decision Tree Induction in Weka • Classifying Unseen Records 4. Classification Results in an ARFF file Class labels assinged Comparison of Techniques • Comparison of Performance in Weka – A system module known as Experimenter – Designated for comparing performances on techniques for classification over a single or a collection of data sets – Data miners setting up an experiment with: • • • • Selected data set(s) Selected algorithms(s) and times of repeated operations Selected test option (e.g. cross validation) Selected p value (indicating confidence) – Output accuracy rates of the algorithms – Pairwise comparison of algorithms with significant better and worse accuracy marked out. Comparison of Techniques • Setting up Experiment in Weka Choosing a Test options New or existing experiment Naming the file to store experiment results Adding data sets No. of times each algorithm repeated Add an algorithm The list of data sets selected The list of selected algorithms Comparison of Techniques • Experiment Results in Weka Analysis method Value of significance Performing the Analysis Loading Experiment Data Results of Pairwise Comparisons Classification in Practice • Process of a Classification Project 1. 2. 3. 4. 5. Locate data Prepare data Choose a classification method Construct the model and tune the model Measure its accuracy and go back to step 3 or 4 until the accuracy is satisfactory 6. Further evaluate the model from other aspects such as complexity, comprehensibility, etc. 7. Deliver the model and test it in real environment. Further modify the model if necessary Classification in Practice • Data Preparation – – – – Identify descriptive features (input attributes) Identify or define the class Determine the sizes of the training, validation and test sets Select examples • • • • Spread and coverage of classes Spread and coverage of attribute values Null values Noisy data – Prepare the input values (categorical to continuous, continuous to categorical) References (1) • • • • • • • • • C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997 C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, 1995 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984 C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998 P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. KDD'95 H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for Effective Classification, ICDE'07 H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for Effective Classification, ICDE'08 W. Cohen. Fast effective rule induction. ICML'95 G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for gene expression data. SIGMOD'05 106 References (3) • T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 2000. • J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, Blackwell Business, 1994. • M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. EDBT'96. • T. M. Mitchell. Machine Learning. McGraw Hill, 1997. • S. K. Murthy, Automatic Construction of Decision Trees from Data: A MultiDisciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998 • J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. • J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93. • J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. • J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96. 107 References (4) • • • • • • • • • R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and pruning. VLDB’98. J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. VLDB’96. J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann, 1990. P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005. S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991. S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2ed. Morgan Kaufmann, 2005. X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03 H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical clusters. KDD'03. 108