Demo: Classification Programs C4.5 CBA Minqing Hu CS594 Fall 2003 UIC C4.5 • Classification using decision tree. • Where to find the program? – C4.5 Release 8: by Ross Quinlan – http://www.cse.unsw.edu.au/~quinlan/ • Running under Unix • Reference book: “C4.5: programs for machine learning” J.Ross Quinlan C4.5 Files • Names files (filestem.names) – provides names for classes, attributes, and attribute values. – Consists of a series of entries, each starting on a new line and ending with a period. • The first entry gives the class names, separated by commas. • The rest of the files consists a single entry for each attribute. – Begins with the attribute name followed by a colon, then a specification of the values that the attribute can take. – Four specifications are possible: » ignore; causes the value of the attribute to be disregarded » continuous; attribute has numeric values » discrete N; N is a positive integer, specifies that the attribute has no more than N discrete values » A list of names separated by commas; Example: Golf.names Play, Don't Play. | class labels outlook: sunny, overcast, rain. temperature: continuous. humidity: continuous. windy: true, false. C4.5 Files (cont) • Data file (filestem.data) – Data file describe the training cases for generating the decision tree and/or rules – Each line describe one case, providing values for all the attributes and then the case’s class, separated by commas and terminated by a period – Attribute values must appear in the same order that the attributes were given in the names file – For missing or unknown data, use ? to specify • Test file (filestem.test) – Use to evaluate the classifier you have produced – In exactly the same format as the data file Example:Golf.data | outlook, temperature, humidity, windy, class label sunny, 85, 85, false, Don't Play sunny, 80, 90, true, Don't Play overcast, 83, 78, false, Play rain, 70, 96, ?, Play rain, 68, ?, false, Play rain, 65, 70, true, Don't Play overcast, 64, 65, true, Play sunny, 72, 95, false, Don't Play sunny, 69, 70, false, Play overcast, 72, 90, true, Play overcast, 81, 75, false, Play rain, 71, 80, true, Don't Play Running the programs • C4.5: decision tree generation “c4.5 –f filestem [-u]” -f filestem (Default: DF) used to specify the filestem of the task -u (Default: no test set) This option is invoked when a test file has been prepared Example: only training: “c4.5 –f ../Data/vote” training and testing: “c4.5 –f ../Data/vote –u” c4.5 output C4.5 [release 8] decision tree generator Fri Sep 12 12:02:31 2003 ---------------------------------------Options: File stem <../Data/vote> Read 300 cases (16 attributes) from ../Data/vote.data Decision Tree: physician fee freeze = n: | adoption of the budget resolution = y: democrat (151.0) | adoption of the budget resolution = u: democrat (1.0) | adoption of the budget resolution = n: | | education spending = n: democrat (6.0) | | education spending = y: democrat (9.0) | | education spending = u: republican (1.0) physician fee freeze = y: | synfuels corporation cutback = n: republican (97.0/3.0) | synfuels corporation cutback = u: republican (4.0) | synfuels corporation cutback = y: | | duty free exports = y: democrat (2.0) | | duty free exports = u: republican (1.0) | | duty free exports = n: | | | education spending = n: democrat (5.0/2.0) | | | education spending = y: republican (13.0/2.0) | | | education spending = u: democrat (1.0) physician fee freeze = u: | water project cost sharing = n: democrat (0.0) | water project cost sharing = y: democrat (4.0) | water project cost sharing = u: | | mx missile = n: republican (0.0) | | mx missile = y: democrat (3.0/1.0) | | mx missile = u: republican (2.0) The numbers at the leaves, in the form (N) or (N/E) •N is the sum of cases that reach the leaf •E is the number of cases that belong to the classes other than the nominated class c4.5 output(cont) Simplified Decision Tree: physician fee freeze = n: democrat (168.0/2.6) physician fee freeze = y: republican (123.0/13.9) physician fee freeze = u: | mx missile = n: democrat (3.0/1.1) | mx missile = y: democrat (4.0/2.2) | mx missile = u: republican (2.0/1.0) c4.5 output(cont) Evaluation on training data (300 items): Before Pruning After Pruning ---------------- --------------------------Size Errors Size Errors Estimate 25 8( 2.7%) 7 13( 4.3%) ( 6.9%) << Evaluation on test data (135 items): Before Pruning After Pruning ---------------- --------------------------Size Errors Size Errors Estimate 25 7( 5.2%) (a) (b) ---- ---80 3 1 51 7 4( 3.0%) <-classified as (a): class democrat (b): class republican ( 6.9%) << Running the programs (cont) • C4.5rules: rule induction Should only be used after running the decision tree program c4.5, since it reads the unpruned file containning the unprunned tree. “c4.5rules –f filestem [-u]” Example: c4.5rules –f ../Data/vote c4.5rules output C4.5 [release 8] rule generator Fri Sep 12 12:07:10 2003 ------------------------------Options: File stem <../Data/vote> Read 300 cases (16 attributes) from ../Data/vote -----------------Processing tree 0 Final rules from tree 0: Rule 2: physician fee freeze = n -> class democrat [98.4%] Rule 9: synfuels corporation cutback = y duty free exports = y -> class democrat [97.5%] … Rule 13: physician fee freeze = u mx missile = u -> class republican [50.0%] Default class: democrat c4.5rules output(cont) Evaluation on training data (300 items): Rule Size Error Used Wrong Advantage ---- ---- ----- ---- ------------2 1 1.6% 168 1 (0.6%) -1 (0|1) democrat 9 2 2.5% 3 0 (0.0%) 0 (0|0) democrat 11 2 29.3% 3 0 (0.0%) 0 (0|0) democrat 5 2 5.2% 97 3 (3.1%) 21 (23|2) republican 7 3 6.0% 15 2 (13.3%) 11 (13|2) republican 3 2 18.0% 2 0 (0.0%) 2 (2|0) republican 13 2 50.0% 2 0 (0.0%) 2 (2|0) republican Drop rule 2 Rule Size Error Used Wrong Advantage ---- ---- ----- ---- ------------9 2 2.5% 54 0 (0.0%) 0 (0|0) democrat 11 2 29.3% 3 0 (0.0%) 0 (0|0) democrat 5 2 5.2% 97 3 (3.1%) 21 (23|2) republican 7 3 6.0% 15 2 (13.3%) 11 (13|2) republican 3 2 18.0% 3 0 (0.0%) 3 (3|0) republican 13 2 50.0% 2 0 (0.0%) 2 (2|0) republican Tested 300, errors 9 (3.0%) << (a) (b) <-classified as ---- ---179 5 (a): class democrat 4 112 (b): class republican Evaluation on test data (135 items): Rule Size Error Used Wrong ---- ---- ----- ---- ------------9 2 2.5% 24 2 (8.3%) 11 2 29.3% 1 0 (0.0%) 5 2 5.2% 41 0 (0.0%) 7 3 6.0% 8 3 (37.5%) 3 2 18.0% 2 0 (0.0%) Advantage 0 (0|0) 0 (0|0) 6 (6|0) 2 (5|3) 2 (2|0) Tested 135, errors 7 (5.2%) << (a) (b) <-classified as ---- ---80 3 (a): class democrat 4 48 (b): class republican democrat democrat republican republican republican confusion matrix & error rate Predicted class Actual class A B A B 80 4 3 48 error rate of this classifier (4+3)/(83+52) = 5.2% CBA • Classification Based on Association – Download at http://www.comp.nus.edu.sg/~dm2 – Use same data types as c4.5,i.e., *.names, *.data, and *.test – Refer to help topics – Discretization function, The discretization program sometime is not compatible with some systems, if errors occurs, then try to use the DOS version of the discretizer under the CBA directory. “discretize” Data Repository online • UCI machine learning repository http://www.ics.uci.edu/~mlearn/MLReposit ory.html