C4.5-CBA slides

advertisement
Demo:
Classification Programs
C4.5
CBA
Minqing Hu
CS594
Fall 2003 UIC
C4.5
• Classification using decision tree.
• Where to find the program?
– C4.5 Release 8: by Ross Quinlan
– http://www.cse.unsw.edu.au/~quinlan/
• Running under Unix
• Reference book: “C4.5: programs for
machine learning” J.Ross Quinlan
C4.5 Files
• Names files (filestem.names)
– provides names for classes, attributes, and attribute
values.
– Consists of a series of entries, each starting on a new
line and ending with a period.
• The first entry gives the class names, separated by commas.
• The rest of the files consists a single entry for each attribute.
– Begins with the attribute name followed by a colon, then a
specification of the values that the attribute can take.
– Four specifications are possible:
» ignore; causes the value of the attribute to be disregarded
» continuous; attribute has numeric values
» discrete N; N is a positive integer, specifies that the
attribute has no more than N discrete values
» A list of names separated by commas;
Example: Golf.names
Play, Don't Play. | class labels
outlook: sunny, overcast, rain.
temperature: continuous.
humidity: continuous.
windy: true, false.
C4.5 Files (cont)
• Data file (filestem.data)
– Data file describe the training cases for generating
the decision tree and/or rules
– Each line describe one case, providing values for all
the attributes and then the case’s class, separated by
commas and terminated by a period
– Attribute values must appear in the same order that
the attributes were given in the names file
– For missing or unknown data, use ? to specify
• Test file (filestem.test)
– Use to evaluate the classifier you have produced
– In exactly the same format as the data file
Example:Golf.data
| outlook, temperature, humidity, windy, class label
sunny, 85, 85, false, Don't Play
sunny, 80, 90, true, Don't Play
overcast, 83, 78, false, Play
rain, 70, 96, ?, Play
rain, 68, ?, false, Play
rain, 65, 70, true, Don't Play
overcast, 64, 65, true, Play
sunny, 72, 95, false, Don't Play
sunny, 69, 70, false, Play
overcast, 72, 90, true, Play
overcast, 81, 75, false, Play
rain, 71, 80, true, Don't Play
Running the programs
• C4.5: decision tree generation
“c4.5 –f filestem [-u]”
-f filestem (Default: DF)
used to specify the filestem of the task
-u (Default: no test set)
This option is invoked when a test file has been
prepared
Example:
only training: “c4.5 –f ../Data/vote”
training and testing: “c4.5 –f ../Data/vote –u”
c4.5 output
C4.5 [release 8] decision tree generator
Fri Sep 12 12:02:31 2003
---------------------------------------Options:
File stem <../Data/vote>
Read 300 cases (16 attributes) from ../Data/vote.data
Decision Tree:
physician fee freeze = n:
| adoption of the budget resolution = y: democrat (151.0)
| adoption of the budget resolution = u: democrat (1.0)
| adoption of the budget resolution = n:
| | education spending = n: democrat (6.0)
| | education spending = y: democrat (9.0)
| | education spending = u: republican (1.0)
physician fee freeze = y:
| synfuels corporation cutback = n: republican (97.0/3.0)
| synfuels corporation cutback = u: republican (4.0)
| synfuels corporation cutback = y:
| | duty free exports = y: democrat (2.0)
| | duty free exports = u: republican (1.0)
| | duty free exports = n:
| | | education spending = n: democrat (5.0/2.0)
| | | education spending = y: republican (13.0/2.0)
| | | education spending = u: democrat (1.0)
physician fee freeze = u:
| water project cost sharing = n: democrat (0.0)
| water project cost sharing = y: democrat (4.0)
| water project cost sharing = u:
| | mx missile = n: republican (0.0)
| | mx missile = y: democrat (3.0/1.0)
| | mx missile = u: republican (2.0)
The numbers at the
leaves, in the form (N)
or (N/E)
•N is the sum of cases
that reach the leaf
•E is the number of
cases that belong to the
classes other than the
nominated class
c4.5 output(cont)
Simplified Decision Tree:
physician fee freeze = n: democrat (168.0/2.6)
physician fee freeze = y: republican (123.0/13.9)
physician fee freeze = u:
| mx missile = n: democrat (3.0/1.1)
| mx missile = y: democrat (4.0/2.2)
| mx missile = u: republican (2.0/1.0)
c4.5 output(cont)
Evaluation on training data (300 items):
Before Pruning
After Pruning
---------------- --------------------------Size
Errors Size
Errors Estimate
25
8( 2.7%)
7 13( 4.3%)
( 6.9%) <<
Evaluation on test data (135 items):
Before Pruning
After Pruning
---------------- --------------------------Size
Errors Size
Errors Estimate
25
7( 5.2%)
(a) (b)
---- ---80 3
1 51
7
4( 3.0%)
<-classified as
(a): class democrat
(b): class republican
( 6.9%) <<
Running the programs (cont)
• C4.5rules: rule induction
Should only be used after running the decision
tree program c4.5, since it reads the
unpruned file containning the unprunned tree.
“c4.5rules –f filestem [-u]”
Example: c4.5rules –f ../Data/vote
c4.5rules output
C4.5 [release 8] rule generator Fri Sep 12 12:07:10 2003
------------------------------Options:
File stem <../Data/vote>
Read 300 cases (16 attributes) from ../Data/vote
-----------------Processing tree 0
Final rules from tree 0:
Rule 2:
physician fee freeze = n
-> class democrat [98.4%]
Rule 9:
synfuels corporation cutback = y
duty free exports = y
-> class democrat [97.5%]
…
Rule 13:
physician fee freeze = u
mx missile = u
-> class republican [50.0%]
Default class: democrat
c4.5rules output(cont)
Evaluation on training data (300 items):
Rule Size Error Used Wrong
Advantage
---- ---- ----- ---- ------------2 1 1.6% 168
1 (0.6%)
-1 (0|1) democrat
9 2 2.5% 3
0 (0.0%)
0 (0|0) democrat
11 2 29.3% 3
0 (0.0%)
0 (0|0) democrat
5 2 5.2% 97
3 (3.1%)
21 (23|2) republican
7 3 6.0% 15
2 (13.3%)
11 (13|2) republican
3 2 18.0% 2
0 (0.0%)
2 (2|0) republican
13 2 50.0% 2
0 (0.0%)
2 (2|0) republican
Drop rule 2
Rule Size Error Used Wrong
Advantage
---- ---- ----- ---- ------------9 2 2.5% 54
0 (0.0%)
0 (0|0) democrat
11 2 29.3% 3
0 (0.0%)
0 (0|0) democrat
5 2 5.2% 97
3 (3.1%)
21 (23|2) republican
7 3 6.0% 15
2 (13.3%)
11 (13|2) republican
3 2 18.0% 3
0 (0.0%)
3 (3|0) republican
13 2 50.0% 2
0 (0.0%)
2 (2|0) republican
Tested 300, errors 9 (3.0%) <<
(a) (b)
<-classified as
---- ---179 5
(a): class democrat
4 112
(b): class republican
Evaluation on test data (135 items):
Rule Size Error Used Wrong
---- ---- ----- ---- ------------9 2 2.5% 24
2 (8.3%)
11 2 29.3% 1
0 (0.0%)
5 2 5.2% 41
0 (0.0%)
7 3 6.0% 8
3 (37.5%)
3 2 18.0% 2
0 (0.0%)
Advantage
0 (0|0)
0 (0|0)
6 (6|0)
2 (5|3)
2 (2|0)
Tested 135, errors 7 (5.2%) <<
(a) (b)
<-classified as
---- ---80 3
(a): class democrat
4 48
(b): class republican
democrat
democrat
republican
republican
republican
confusion matrix & error rate
Predicted class
Actual class
A
B
A
B
80
4
3
48
error rate of this classifier
(4+3)/(83+52) = 5.2%
CBA
• Classification Based on Association
– Download at http://www.comp.nus.edu.sg/~dm2
– Use same data types as c4.5,i.e., *.names, *.data,
and *.test
– Refer to help topics
– Discretization function, The discretization program
sometime is not compatible with some systems, if
errors occurs, then try to use the DOS version of the
discretizer under the CBA directory. “discretize”
Data Repository online
• UCI machine learning repository
http://www.ics.uci.edu/~mlearn/MLReposit
ory.html
Download