DM7: Classification: C4.5

Machine Learning in Real World: C4.5 Outline  Handling Numeric Attributes  Finding Best Split(s)  Dealing with Missing Values  Pruning  Pre-pruning, Post-pruning, Error Estimates  From Trees to Rules 2 Industrial-strength algorithms   For an algorithm to be useful in a wide range of realworld applications it must:  Permit numeric attributes  Allow missing values  Be robust in the presence of noise  Be able to approximate arbitrary concept descriptions (at least in principle) Basic schemes need to be extended to fulfill these requirements witten & eibe 3 C4.5 History  ID3, CHAID – 1960s  C4.5 innovations (Quinlan):  permit numeric attributes  deal sensibly with missing values  pruning to deal with for noisy data  C4.5 - one of best-known and most widely-used learning algorithms  Last research version: C4.8, implemented in Weka as J4.8 (Java)  Commercial successor: C5.0 (available from Rulequest) 4 Numeric attributes  Standard method: binary splits  E.g. temp < 45  Unlike nominal attributes, every attribute has many possible split points  Solution is straightforward extension:   Evaluate info gain (or other measure) for every possible split point of attribute  Choose “best” split point  Info gain for best split point is info gain for attribute Computationally more demanding witten & eibe 5 Weather data – nominal values Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild Normal False Yes … … … … … If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes witten & eibe 6 Weather data - numeric Outlook Temperature Humidity Windy Play Sunny 85 85 False No Sunny 80 90 True No Overcast 83 86 False Yes Rainy 75 80 False Yes … … … … … If outlook = sunny and humidity > 83 then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity < 85 then play = yes If none of the above then play = yes 7 Example  Split on temperature attribute: 64 65 68 69 Yes No Yes Yes 70 71 72 Yes No No 72 75 Yes Yes 75 80 81 83 Yes No Yes Yes No  E.g. temperature  71.5: yes/4, no/2 temperature  71.5: yes/5, no/3  Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits  Place split points halfway between values  Can evaluate all split points in one pass! witten & eibe 8 85 Avoid repeated sorting!  Sort instances by the values of the numeric attribute  Time complexity for sorting: O (n log n)  Q. Does this have to be repeated at each node of the tree?  A: No! Sort order for children can be derived from sort order for parent  Time complexity of derivation: O (n)  Drawback: need to create and store an array of sorted indices for each numeric attribute witten & eibe 9 More speeding up  Entropy only needs to be evaluated between points of different classes (Fayyad & Irani, 1992) value 64 class Yes 65 68 69 No Yes Yes 70 71 72 Yes No No 72 75 Yes Yes 75 80 81 83 85 Yes No Yes Yes No X Potential optimal breakpoints Breakpoints between values of the same class cannot be optimal 10 Binary vs. multi-way splits  Splitting (multi-way) on a nominal attribute exhausts all information in that attribute   Nominal attribute is tested (at most) once on any path in the tree Not so for binary splits on numeric attributes!  Numeric attribute may be tested several times along a path in the tree  Disadvantage: tree is hard to read  Remedy: witten & eibe  pre-discretize numeric attributes, or  use multi-way splits instead of binary ones 11 Missing as a separate value  Missing value denoted “?” in C4.X  Simple idea: treat missing as a separate value  Q: When this is not appropriate?  A: When values are missing due to different reasons  Example 1: gene expression could be missing when it is very high or very low  Example 2: field IsPregnant=missing for a male patient should be treated differently (no) than for a female patient of age 25 (unknown) 12 Missing values - advanced Split instances with missing values into pieces   A piece going down a branch receives a weight proportional to the popularity of the branch  weights sum to 1 Info gain works with fractional instances   use sums of weights instead of counts During classification, split the instance into pieces in the same way  witten & eibe Merge probability distribution using weights 13 Pruning  Goal: Prevent overfitting to noise in the data  Two strategies for “pruning” the decision tree:  Postpruning - take a fully-grown decision tree and discard unreliable parts  Prepruning - stop growing a branch when information becomes unreliable  Postpruning preferred in practice— prepruning can “stop too early” 14 Prepruning  Based on statistical significance test  Stop growing the tree when there is no statistically significant association between any attribute and the class at a particular node  Most popular test: chi-squared test  ID3 used chi-squared test in addition to information gain  witten & eibe Only statistically significant attributes were allowed to be selected by information gain procedure 15 Early stopping a b class 1 0 0 0 2 0 1 1 3 1 0 1 4 1 1 0  Pre-pruning may stop the growth process prematurely: early stopping  Classic example: XOR/Parity-problem  No individual attribute exhibits any significant association to the class  Structure is only visible in fully expanded tree  Pre-pruning won’t expand the root node  But: XOR-type problems rare in practice  And: pre-pruning faster than post-pruning witten & eibe 16 Post-pruning  First, build full tree  Then, prune it  Fully-grown tree shows all attribute interactions  Problem: some subtrees might be due to chance effects  Two pruning operations: 1. Subtree replacement 2. Subtree raising Possible strategies:   error estimation  significance testing  MDL principle witten & eibe 17 Subtree replacement, 1  Bottom-up  Consider replacing a tree only after considering all its subtrees  Ex: labor negotiations witten & eibe 18 Subtree replacement, 2 What subtree can we replace? 19 Subtree replacement, 3  Bottom-up  Consider replacing a tree only after considering all its subtrees witten & eibe 20 Estimating error rates  Prune only if it reduces the estimated error  Error on the training data is NOT a useful estimator Q: Why it would result in very little pruning?  Use hold-out set for pruning (“reduced-error pruning”)  C4.5’s method witten & eibe  Derive confidence interval from training data  Use a heuristic limit, derived from this, for pruning  Standard Bernoulli-process-based method  Shaky statistical assumptions (based on training data) 22 *Mean and variance  Mean and variance for a Bernoulli trial: p, p (1–p)  Expected success rate f=S/N  Mean and variance for f : p, p (1–p)/N  For large enough N, f follows a Normal distribution  c% confidence interval [–z  X  z] for random variable with 0 mean is given by:  Pr[  z  X  z ]  c With a symmetric distribution: Pr[  z  X  z ]  1  2  Pr[ X  z ] witten & eibe 23 *Confidence limits  Confidence limits for the normal distribution with 0 mean and a variance of 1: –1  0 1 1.65 Pr[X  z] z 0.1% 3.09 0.5% 2.58 1% 2.33 5% 1.65 10% 1.28 20% 0.84 25% 0.69 40% 0.25 Thus: Pr[  1 . 65  X  1 . 65 ]  90 %  To use this we have to reduce our random variable f to have 0 mean and unit variance witten & eibe 24 *Transforming f  f  p Transformed value for f : p (1  p ) / N (i.e. subtract the mean and divide by the standard deviation)   Resulting equation:  Pr   z   Solving for p: 2  z p f  z  2N  witten & eibe 25   z  c p (1  p ) / N  f  p     2  N N 4N  f f 2 z 2  z  1   N   2 C4.5’s method  Error estimate for subtree is weighted sum of error estimates for all its leaves  Error estimate for a node (upper bound): 2  z e f  z  2N      2  N N 4N  f f 2 z 2 2  z   1   N    If c = 25% then z = 0.69 (from normal distribution)  f is the error on the training data  N is the number of instances covered by the leaf witten & eibe 26 Example f = 5/14 e = 0.46 e < 0.51 so prune! f=0.33 e=0.47 witten & eibe f=0.5 e=0.72 f=0.33 e=0.47 Combined27using ratios 6:2:6 gives 0.51 From trees to rules – how? How can we produce a set of rules from a decision tree? 29 From trees to rules – simple  Simple way: one rule for each leaf  C4.5rules: greedily prune conditions from each rule if this reduces its estimated error  Can produce duplicate rules  Check for this at the end  Then  look at each class in turn  consider the rules for that class  find a “good” subset (guided by MDL)  Then rank the subsets to avoid conflicts  Finally, remove rules (greedily) if this decreases error on the training data witten & eibe 30 C4.5rules: choices and options  C4.5rules slow for large and noisy datasets  Commercial version C5.0rules uses a different technique   Much faster and a bit more accurate C4.5 has two parameters  Confidence value (default 25%): lower values incur heavier pruning  Minimum number of instances in the two most popular branches (default 2) witten & eibe 31 Summary  Decision Trees  splits – binary, multi-way  split criteria – entropy, gini, …  missing value treatment  pruning  rule extraction from trees  Both C4.5 and CART are robust tools  No method is always superior – experiment! witten & eibe 36

DM7: Classification: C4.5

Related documents

Products

Support

DM7: Classification: C4.5

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib