Bab 04 – Classification Part-1

advertisement
Bab 4
Classification: Basic Concepts,
Decision Trees & Model Evaluation
Part 1
Classification With Decision tree
Bab 4.1 - 1/44
Classification: Definition
Bab 4.1 - 2/44
Example of Classification Task
Bab 4.1 - 3/44
General Approach for Building
Classification Model
Bab 4.1 - 4/44
Classification Techniques
Bab 4.1 - 5/44
Example of Decision Tree
Bab 4.1 - 6/44
Another Example of Decision Tree
Bab 4.1 - 7/44
Decision Tree Classification Task
Bab 4.1 - 8/44
Apply Model to Test Data
Bab 4.1 - 9/44
Decision Tree Classification Task
Bab 4.1 - 10/44
Decision Tree Induction
Bab 4.1 - 11/44
General Structure of Hunt’s Algorithm
Bab 4.1 - 12/44
Hunt’s Algorithm
Bab 4.1 - 13/44
Design Issues of Decision Tree Induction
Bab 4.1 - 14/44
Methods for Expression Test Conditions
Bab 4.1 - 15/44
Test Condition for Nominal Attributes
Bab 4.1 - 16/44
Test Condition for Ordinal Attributes
Bab 4.1 - 17/44
Test Condition for Continues Attributes
Bab 4.1 - 18/44
Splitting Based on Continues Attributes
Bab 4.1 - 19/44
How to Determine the Best Split / 1
Bab 4.1 - 20/44
How to Determine the Best Split / 2
Bab 4.1 - 21/44
Measures of Node Impurity
Bab 4.1 - 22/44
Finding the Best Split / 1
Bab 4.1 - 23/44
Finding the Best Split / 2
Bab 4.1 - 24/44
Measure of Impurity: GINI
Bab 4.1 - 25/44
Computing GINI Index of a Single Node
Bab 4.1 - 26/44
Computing GINI Index for a Collection of Nodes
Bab 4.1 - 27/44
Binary Attributes: Computing GINI Index
Bab 4.1 - 28/44
Categorical Attributes: Computing GINI Index
Bab 4.1 - 29/44
Continuous Attributes: Computing GINI Index / 1
Bab 4.1 - 30/44
Continuous Attributes: Computing GINI Index / 2
Bab 4.1 - 31/44
Measure of Impurity: Entropy
Bab 4.1 - 32/44
Computing Entropy of a Single Node
Bab 4.1 - 33/44
Computing information Gain After Splitting
Bab 4.1 - 34/44
Problems with Information Gain
Bab 4.1 - 35/44
Gain Ratio
Bab 4.1 - 36/44
Measure of Impurity: Classification Error
Bab 4.1 - 37/44
Computing Error of a Single Node
Bab 4.1 - 38/44
Comparison among Impurity Measures
For binary (2-class) classification problems
Bab 4.1 - 39/44
Misclassification Error vs Gini index
Bab 4.1 - 40/44
Example: C4.5
•
•
•
•
•
Simple depth-first construction.
Uses Information Gain
Sorts Continuous Attributes at each node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
 Needs out-of-core sorting.
• You can download the software from:
http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
Bab 4.1 - 41/44
Scalable Decision Tree Induction / 1
• How scalable is decision tree induction?
 Particularly suitable for small data set
• SLIQ (EDBT’96 — Mehta et al.)
 Builds an index for each attribute and only class list and
the current attribute list reside in memory
Bab 4.1 - 42/44
Scalable Decision Tree Induction / 2
• SLIQ
RID
Credit_rating
Age
Buys_computer
1
excellent
38
yes
2
excellent
26
yes
3
fair
35
no
4
excellent
49
no
0
Sample data for the class buys_computer
Credit_rating
RID
age
RID
RID
Buys_computer
node
excellent
1
26
2
1
yes
5
excellent
2
35
3
2
yes
2
excellent
4
38
1
3
no
3
fair
3
49
4
4
no
6
…
…
…
…
…
…
…
Disk-resident attribute lists
Memory-resident class list
1
2
3
4
5
6
Bab 4.1 - 43/44
Decision Tree Based Classification
• Advantages




Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized tress
Accuracy is comparable to other classification
techniques for many data sets
• Practical Issues of Classification
 Underfitting and Overfitting
 Missing Values
 Costs of Classification
Bab 4.1 - 44/44
Download