Document 15062994

advertisement
Matakuliah : M0614 / Data Mining & OLAP
Tahun
: Feb - 2010
Classification and Prediction
Pertemuan 07
Learning Outcomes
Pada akhir pertemuan ini, diharapkan mahasiswa
akan mampu :
• Mahasiswa dapat menggunakan teknik analisis
classification by decision tree induction, Bayesian
classification, classification by back propagation, dan
lazy learners pada data mining. (C3)
3
Bina Nusantara
Acknowledgments
These slides have been adapted from Han, J.,
Kamber, M., & Pei, Y. Data Mining: Concepts
and Technique and Tan, P.-N., Steinbach, M.,
& Kumar, V. Introduction to Data Mining.
Bina Nusantara
Outline Materi
•
•
•
•
What is classification?
What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
5
Bina Nusantara
Supervised vs. Unsupervised Learning
•
Supervised learning (classification)
– Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the observations
– New data is classified based on the training set
•
Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
June 28, 2016
Data Mining: Concepts and Techniques
6
Classification vs. Prediction
•
•
•
Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data
Prediction
– models continuous-valued functions, i.e., predicts unknown or
missing values
Typical applications
– Credit/loan approval:
– Medical diagnosis: if a tumor is cancerous or benign
– Fraud detection: if a transaction is fraudulent
– Web page categorization: which category it is
June 28, 2016
Data Mining: Concepts and Techniques
7
Classification: Definition
•
•
•
Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is
the class.
Find a model for class attribute as a function of the values of other
attributes.
Goal: previously unseen records should be assigned a class as
accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
Illustrating Classification Task
Attrib2
Attrib3
Class
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Attrib2
Attrib3
Class
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Apply
Model
Deduction
Classification—A Two-Step Process
•
•
Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
– If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
June 28, 2016
Data Mining: Concepts and Techniques
10
Process (1): Model Construction
Classification
Algorithms
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
Anne
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no
June 28, 2016
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Data Mining: Concepts and Techniques
11
Process (2): Using the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
M erlisa
G eo rg e
Jo sep h
June 28, 2016
RANK
YEARS TENURED
A ssistan t P ro f
2
no
A sso ciate P ro f
7
no
P ro fesso r
5
yes
A ssistan t P ro f
7
yes
Data Mining: Concepts and Techniques
Tenured?
12
Issues: Data Preparation
• Data cleaning
– Preprocess data in order to reduce noise and handle missing
values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data
June 28, 2016
Data Mining: Concepts and Techniques
13
Issues: Evaluating Classification Methods
•
•
•
•
•
•
Accuracy
– classifier accuracy: predicting class label
– predictor accuracy: guessing value of predicted attributes
Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
– understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules
June 28, 2016
Data Mining: Concepts and Techniques
14
Classification by Decision Tree Induction
•
•
•
Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Married
Model: Decision Tree
Another Example of Decision Tree
MarSt
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that
fits the same data!
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Attrib2
10
Test Set
Attrib3
Apply
Model
Class
Deduction
Decision
Tree
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Assign Cheat to “No”
Decision Tree Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
11
No
Small
Attrib2
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Attrib3
Apply
Model
Class
Deduction
Decision
Tree
Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
General Structure of Hunt’s Algorithm
• Let Dt be the set of training records
that reach a node t
• General Procedure:
– If Dt contains records that
belong the same class yt, then t
is a leaf node labeled as yt
– If Dt is an empty set, then t is a
leaf node labeled by the default
class, yd
– If Dt contains records that
belong to more than one class,
use an attribute test to split the
data into smaller subsets.
Recursively apply the procedure
to each subset.19
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Dt
?
60K
Hunt’s Algorithm
Don’t
Cheat
Refund
Yes
No
Don’t
Cheat
Don’t
Cheat
Refund
Refund
Yes
Yes
No
Don’t
Cheat
Don’t
Cheat
Marital
Status
Single,
Divorced
Cheat
Married
No
Marital
Status
Single,
Divorced
Don’t
Cheat
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Married
Don’t
Cheat
Taxable
Income
< 80K
>= 80K
Don’t
Cheat
Cheat
60K
Tree Induction
•
Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.
•
Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Tree Induction
•
Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.
•
Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split
– 2-way split
– Multi-way split
Splitting Based on Nominal Attributes
• Multi-way split: Use as many partitions as distinct values.
CarType
Family
Luxury
Sports
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Sports,
Luxury}
CarType
{Family}
OR
{Family,
Luxury}
CarType
{Sports}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.
Size
Small
Large
Medium
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Small,
Medium}
Size
{Large}
OR
{Medium,
Large}
Size
• What about this split?
{Small,
Large}
Size
{Medium}
{Small}
Splitting Based on Continuous Attributes
• Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
– Binary Decision: (A < v) or (A  v)
• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based on Continuous Attributes
Taxable
Income
> 80K?
Taxable
Income?
< 10K
Yes
> 80K
No
[10K,25K)
(i) Binary split
[25K,50K)
[50K,80K)
(ii) Multi-way split
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Own
Car?
Yes
Car
Type?
No
Family
Student
ID?
Luxury
c1
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Which test condition is the best?
C0: 1
C1: 0
...
c10
C0: 1
C1: 0
c11
C0: 0
C1: 1
c20
...
C0: 0
C1: 1
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
How to Find the Best Split
Before Splitting:
C0
C1
N00
N01
M0
A?
B?
Yes
No
Node N1
C0
C1
Node N2
N10
N11
C0
C1
N20
N21
M2
M1
Yes
No
Node N3
C0
C1
Node N4
N30
N31
C0
C1
M3
M12
M4
M34
Gain = M0 – M12 vs M0 – M34
N40
N41
Measure of Impurity: GINI
•
Gini Index for a given node t :
GINI (t )  1   [ p( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed among all
classes, implying least interesting information
– Minimum (0.0) when all records belong to one class, implying most
interesting information
C1
C2
0
6
Gini=0.000
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
C1
C2
3
3
Gini=0.500
Examples for computing GINI
GINI (t )  1   [ p( j | t )]2
j
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
2
4
P(C1) = 2/6
C1
C2
P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Splitting Based on GINI
•
•
Used in CART, SLIQ, SPRINT.
When a node p is split into k partitions (children), the quality of split is
computed as,
k
ni
GINI split   GINI (i )
i 1 n
where,
ni = number of records at child i,
n = number of records at node p.
Binary Attributes: Computing GINI Index

Splits into two partitions

Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B?
Yes
No
C1
6
C2
6
Gini = 0.500
Gini(N1)
= 1 – (5/6)2 – (2/6)2
= 0.194
Gini(N2)
= 1 – (1/6)2 – (4/6)2
= 0.528
Node N1
Node N2
C1
C2
N1
5
2
N2
1
4
Gini=0.333
Gini(Children)
= 7/12 * 0.194 +
5/12 * 0.528
= 0.333
Categorical Attributes: Computing Gini Index
• For each distinct value, gather counts for each class in the dataset
• Use the count matrix to make decisions
Multi-way split
Two-way split
(find best partition of values)
CarType
Family Sports Luxury
C1
C2
Gini
1
4
2
1
0.393
1
1
C1
C2
Gini
CarType
{Sports,
{Family}
Luxury}
3
1
2
4
0.400
C1
C2
Gini
CarType
{Family,
{Sports}
Luxury}
2
2
1
5
0.419
Continuous Attributes: Computing Gini Index
• Use Binary Decisions based on one value
• Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
• Each splitting value has a count matrix
associated with it
– Class counts in each of the partitions,
A < v and A  v
• Simple method to choose best v
– For each v, scan the database to
gather count matrix and compute its
Gini index
– Computationally Inefficient! Repetition
of work.
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Taxable
Income
> 80K?
Yes
No
Continuous Attributes: Computing Gini Index...
•
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index
Cheat
No
No
No
Yes
Yes
Yes
No
No
No
No
100
120
125
220
Taxable Income
60
Sorted Values
70
55
Split Positions
75
65
85
72
90
80
95
87
92
97
110
122
172
230
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
Yes
0
3
0
3
0
3
0
3
1
2
2
1
3
0
3
0
3
0
3
0
3
0
No
0
7
1
6
2
5
3
4
3
4
3
4
3
4
4
3
5
2
6
1
7
0
Gini
0.420
0.400
0.375
0.343
0.417
0.400
0.300
0.343
0.375
0.400
0.420
Alternative Splitting Criteria based on INFO
•
Entropy at a given node t:
Entropy(t )   p( j | t ) log p( j | t )
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Measures homogeneity of a node.
• Maximum (log nc) when records are equally distributed among all classes
implying least information
• Minimum (0.0) when all records belong to one class, implying most
information
– Entropy based computations are similar to the GINI index computations
Examples for computing Entropy
Entropy(t )   p( j | t ) log p( j | t )
j
C1
C2
0
6
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
P(C1) = 0/6 = 0
2
P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Splitting Based on INFO...
•
Information Gain:
n


GAIN  Entropy ( p)    Entropy (i) 
 n

k
split
i
i 1
Parent Node, p is split into k partitions;
ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the split.
Choose the split that achieves most reduction (maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.
Splitting Based on INFO...
•
Gain Ratio:
GAIN
n
n
GainRATIO 
SplitINFO    log
SplitINFO
n
n
Split
split
k
i
i 1
Parent Node, p is split into k partitions
ni is the number of records in partition i
– Adjusts Information Gain by the entropy of the partitioning (SplitINFO).
Higher entropy partitioning (large number of small partitions) is penalized!
– Used in C4.5
– Designed to overcome the disadvantage of Information Gain
i
Splitting Criteria based on Classification Error
•
Classification error at a node t :
Error (t )  1  max P(i | t )
i
•
Measures misclassification error made by a node.
• Maximum (1 - 1/nc) when records are equally distributed among
all classes, implying least interesting information
• Minimum (0.0) when all records belong to one class, implying
most interesting information
Examples for Computing Error
Error (t )  1  max P(i | t )
i
C1
C2
0
6
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Comparison among Splitting Criteria
For a 2-class problem:
Misclassification Error vs Gini
Parent
A?
Yes
No
Node N1
Gini(N1)
= 1 – (3/3)2 – (0/3)2
=0
Gini(N2)
= 1 – (4/7)2 – (3/7)2
= 0.489
Node N2
C1
C2
N1
3
0
N2
4
3
Gini=0.361
C1
7
C2
3
Gini = 0.42
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Gini improves !!
Tree Induction
•
Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.
•
Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Stopping Criteria for Tree Induction
• Stop expanding a node when all the records belong to the same
class
• Stop expanding a node when all the records have similar attribute
values
• Early termination (to be discussed later)
Attribute Selection Measure: Information Gain (ID3/C4.5)



Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
m


Info( D)   pi log 2 ( pi )
Information needed (after using A to split D into v partitions) to classify D:i 1
Information gained by branching on attribute A
v
| Dj |
j 1
|D|
InfoA ( D)  
 I (D j )
Gain(A)  Info(D)  InfoA(D)
June 28, 2016
Data Mining: Concepts and Techniques
58
Attribute Selection: Information Gain
Class P: buys_computer = “yes”
 Class N: buys_computer = “no”

Info( D)  I (9,5)  
age
<=30
31…40
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40June
Infoage ( D ) 
9
9
5
5
log 2 ( )  log 2 ( ) 0.940
14
14 14
14
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
28,
2016

5
4
I ( 2,3) 
I (4,0)
14
14
5
I (3,2)  0.694
14
5
I (2,3) means “age <=30” has 5 out of
14 14 samples, with 2 yes’es and 3
no’s. Hence
buys_computer
no
no
yes
yes
yes
Similarly,
no
yes
no
yes
yes
yes
yes
yes
Data no
Mining: Concepts and Techniques
Gain(age)  Info( D)  Infoage ( D)  0.246
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
59
Computing Information-Gain for
Continuous-Value Attributes
•
Let attribute A be a continuous-valued attribute
•
Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent values is considered
as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point with the minimum expected information requirement for A is
selected as the split-point for A
•
Split:
– D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of
tuples in D satisfying A > split-point
June 28, 2016
Data Mining: Concepts and Techniques
60
Gain Ratio for Attribute Selection (C4.5)
•
Information gain measure is biased towards attributes with a large
number of values
•
C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)
v
| Dj |
j 1
|D|
SplitInfo A ( D)  
 log 2 (
| Dj |
|D|
)
– GainRatio(A) = Gain(A)/SplitInfo(A)
•
Ex.
SplitInfo A ( D )  
4
4
6
6
4
4
 log 2 (
)
 log 2 (
)
 log 2 (
)  0.926
14
14
14
14
14
14
– gain_ratio(income) = 0.029/0.926 = 0.031
•
The attribute with the maximum gain ratio is selected as the splitting
attribute
June 28, 2016
Data Mining: Concepts and Techniques
61
Comparing Attribute Selection Measures
•
The three measures, in general, return good results but
– Information gain:
• biased towards multivalued attributes
– Gain ratio:
• tends to prefer unbalanced splits in which one partition is much
smaller than the others
– Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions and purity
in both partitions
June 28, 2016
Data Mining: Concepts and Techniques
62
Decision Tree Based Classification
• Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification techniques for
many simple data sets
Classification in Large Databases
•
Classification—a classical problem extensively studied by statisticians and
machine learning researchers
•
Scalability: Classifying data sets with millions of examples and hundreds of
attributes with reasonable speed
•
Why decision tree induction in data mining?
–
–
–
–
relatively faster learning speed (than other classification methods)
convertible to simple and easy to understand classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
June 28, 2016
Data Mining: Concepts and Techniques
64
Scalable Decision Tree Induction Methods
•
•
•
•
•
SLIQ
– Builds an index for each attribute and only class list and the current
attribute list reside in memory
SPRINT
– Constructs an attribute list data structure
PUBLIC
– Integrates tree splitting and tree pruning: stop growing the tree earlier
RainForest
– Builds an AVC-list (attribute, value, class label)
BOAT
– Uses bootstrapping to create several small samples
June 28, 2016
Data Mining: Concepts and Techniques
65
Scalability Framework for RainForest
•
Separates the scalability aspects from the criteria that determine the quality
of the tree
•
Builds an AVC-list: AVC (Attribute, Value, Class_label)
•
AVC-set (of an attribute X )
– Projection of training dataset onto the attribute X and class label where
counts of individual class label are aggregated
•
AVC-group (of a node n )
– Set of AVC-sets of all predictor attributes at the node n
June 28, 2016
Data Mining: Concepts and Techniques
66
Rainforest: Training Set and Its AVC Sets
Training Examples
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
AVC-set on Age
income studentcredit_rating
buys_computer
Age
Buy_Computer
high
no fair
no
yes
no
high
no excellent no
high
no fair
yes
<=30
3
2
medium
no fair
yes
31..40
4
0
low
yes fair
yes
>40
3
2
low
yes excellent no
low
yes excellent yes
AVC-set on Student
medium
no fair
no
low
yes fair
yes
student
Buy_Computer
medium yes fair
yes
yes
no
medium yes excellent yes
yes
6
1
medium
no excellent yes
no
3
4
high
yes fair
yes
medium
no excellent no
June 28, 2016
Data Mining: Concepts and Techniques
AVC-set on income
income
Buy_Computer
yes
no
high
2
2
medium
4
2
low
3
1
AVC-set on
credit_rating
Buy_Computer
Credit
rating
yes
no
fair
6
2
excellent
3
3
67
Data Cube-Based Decision-Tree Induction
•
Integration of generalization with decision-tree induction
•
Classification at primitive concept levels
– E.g., precise temperature, humidity, outlook, etc.
– Low-level concepts, scattered classes, bushy classification-trees
– Semantic interpretation problems
•
Cube-based multi-level classification
– Relevance analysis at multi-levels
– Information-gain analysis with dimension + level
June 28, 2016
Data Mining: Concepts and Techniques
68
Presentation of Classification Results
June 28, 2016
Data Mining: Concepts and Techniques
69
Dilanjutkan ke pert. 08
Classification and Prediction (cont.)
Bina Nusantara
Download