Data Mining: Concepts and Techniques — Chapter 6 —

advertisement
Data Mining:
Concepts and Techniques
— Chapter 6 —
Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber. All rights reserved.
July 26, 2016
Data Mining: Concepts and Techniques
1
Chapter 6. Classification and Prediction


What is classification? What is

Support Vector Machines (SVM)
prediction?

Associative classification
Issues regarding classification

Lazy learners (or learning from
and prediction

your neighbors)
Classification by decision tree
induction

Bayesian classification

Rule-based classification

Classification by back
propagation
July 26, 2016

Other classification methods

Prediction

Accuracy and error measures

Ensemble methods

Model selection

Summary
Data Mining: Concepts and Techniques
2
Classification vs. Prediction



Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
 Credit approval
 Target marketing
 Medical diagnosis
 Fraud detection
July 26, 2016
Data Mining: Concepts and Techniques
3
Classification—A Two-Step Process


Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
July 26, 2016
Data Mining: Concepts and Techniques
4
Process (1): Model Construction
Classification
Algorithms
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
Anne
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no
July 26, 2016
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Data Mining: Concepts and Techniques
5
Process (2): Using the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
M erlisa
G eo rg e
Jo sep h
July 26, 2016
RANK
YEARS TENURED
A ssistan t P ro f
2
no
A sso ciate P ro f
7
no
P ro fesso r
5
yes
A ssistan t P ro f
7
yes
Data Mining: Concepts and Techniques
Tenured?
6
Supervised vs. Unsupervised Learning

Supervised learning (classification)



Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)


July 26, 2016
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Data Mining: Concepts and Techniques
7
Chapter 6. Classification and Prediction


What is classification? What is

Support Vector Machines (SVM)
prediction?

Associative classification
Issues regarding classification

Lazy learners (or learning from
and prediction

your neighbors)
Classification by decision tree
induction

Bayesian classification

Rule-based classification

Classification by back
propagation
July 26, 2016

Other classification methods

Prediction

Accuracy and error measures

Ensemble methods

Model selection

Summary
Data Mining: Concepts and Techniques
8
Issues: Data Preparation

Data cleaning


Relevance analysis (feature selection)


Preprocess data in order to reduce noise and handle
missing values
Remove the irrelevant or redundant attributes
Data transformation

Generalize and/or normalize data
July 26, 2016
Data Mining: Concepts and Techniques
9
Issues: Evaluating Classification Methods






Accuracy
 classifier accuracy: predicting class label
 predictor accuracy: guessing value of predicted
attributes
Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
 understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision
tree size or compactness of classification rules
July 26, 2016
Data Mining: Concepts and Techniques
10
Chapter 6. Classification and Prediction


What is classification? What is

Support Vector Machines (SVM)
prediction?

Associative classification
Issues regarding classification

Lazy learners (or learning from
and prediction

your neighbors)
Classification by decision tree
induction

Bayesian classification

Rule-based classification

Classification by back
propagation
July 26, 2016

Other classification methods

Prediction

Accuracy and error measures

Ensemble methods

Model selection

Summary
Data Mining: Concepts and Techniques
11
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Model: Decision Tree
Training Data
July 26, 2016
Data Mining: Concepts and Techniques
12
Another Example
MarSt
Tid Refund Marital
Status
Taxable
Income Cheat
NO
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
6
No
Married
7
Yes
Divorced 220K
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
> 80K
Yes
NO
No
YES
No
There could be more than one tree that
fits the same data!
10
July 26, 2016
Data Mining: Concepts and Techniques
13
Decision Tree Classification
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Decision
Model
Tree
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
July 26, 2016
Data Mining: Concepts and Techniques
14
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
July 26, 2016
NO
> 80K
YES
Data Mining: Concepts and Techniques
15
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
July 26, 2016
NO
> 80K
YES
Data Mining: Concepts and Techniques
16
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
July 26, 2016
NO
> 80K
YES
Data Mining: Concepts and Techniques
17
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
July 26, 2016
NO
> 80K
YES
Data Mining: Concepts and Techniques
18
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
July 26, 2016
NO
> 80K
YES
Data Mining: Concepts and Techniques
19
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
July 26, 2016
Assign Cheat to “No”
NO
> 80K
YES
Data Mining: Concepts and Techniques
20
Decision Tree Classification
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
Decision
Tree
10
Test Set
July 26, 2016
Data Mining: Concepts and Techniques
21
Decision Tree Induction

Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ, SPRINT
July 26, 2016
Data Mining: Concepts and Techniques
22
Decision Tree Induction: Training Dataset
This
follows an
example
of
Quinlan’s
ID3
(Playing
Tennis)
July 26, 2016
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Data Mining: Concepts and Techniques
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
23
Output: A Decision Tree for “buys_computer”
age?
<=30
31..40
overcast
student?
no
no
July 26, 2016
yes
>40
credit rating?
excellent
yes
yes
Data Mining: Concepts and Techniques
fair
yes
24
Algorithm for Decision Tree Induction


Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
July 26, 2016
Data Mining: Concepts and Techniques
25
Attribute Selection Measure:
Information Gain (ID3/C4.5)



Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple
m
in D:
Info( D)   pi log 2 ( pi )
i 1


Information needed (after using A to split D into v
v |D |
partitions) to classify D:
j
InfoA ( D)  
 I (D j )
j 1 | D |
Information gained by branching on attribute A
Gain(A)  Info(D)  InfoA(D)
July 26, 2016
Data Mining: Concepts and Techniques
26
Attribute Selection: Information Gain


Class P: buys_computer = “yes”
Class N: buys_computer = “no”
Info( D)  I (9,5)  
age
<=30
31…40
>40
Infoage ( D ) 
9
9
5
5
log 2 ( )  log 2 ( ) 0.940
14
14 14
14
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
age
income student credit_rating
<=30
high
no
fair
<=30
high
no
excellent
31…40 high
no
fair
>40
medium
no
fair
>40
low
yes fair
>40
low
yes excellent
31…40 low
yes excellent
<=30
medium
no
fair
<=30
low
yes fair
>40
medium
yes fair
<=30
medium
yes excellent
31…40 medium
no
excellent
31…40 high
yes fair
2016
>40July 26,
medium
no
excellent

5
4
I ( 2,3) 
I (4,0)
14
14
5
I (3,2)  0.694
14
5
I (2,3) means “age <=30” has 5
14
out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Gain(age)  Info( D)  Infoage ( D)  0.246
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
Data no
Mining: Concepts and Techniques
Similarly,
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
27
Computing Information-Gain for
Continuous-Value Attributes

Let attribute A be a continuous-valued attribute

Must determine the best split point for A


Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent
values is considered as a possible split point



(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information
requirement for A is selected as the split-point for A
Split:

D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
July 26, 2016
Data Mining: Concepts and Techniques
28
Gain Ratio for Attribute Selection (C4.5)


Information gain measure is biased towards attributes
with a large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v
SplitInfo A ( D)  
j 1



|D|
 log 2 (
| Dj |
|D|
)
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.

| Dj |
SplitInfo A ( D)  
4
4
6
6
4
4
 log 2 ( )   log 2 ( )   log 2 ( )  0.926
14
14 14
14 14
14
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as
the splitting attribute
July 26, 2016
Data Mining: Concepts and Techniques
29
Gini index (CART, IBM IntelligentMiner)

If a data set D contains examples from n classes, gini index, gini(D) is
defined as
n
gini( D)  1  p 2j
j 1

where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
gini A (D) 


Reduction in Impurity:
|D1|
|D |
gini(D1)  2 gini(D2)
|D|
|D|
gini( A)  gini(D)  giniA(D)
The attribute provides the smallest ginisplit(D) (or the largest reduction
in impurity) is chosen to split the node (need to enumerate all the
possible splitting points for each attribute)
July 26, 2016
Data Mining: Concepts and Techniques
30
Gini index (CART, IBM IntelligentMiner)

Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2
2
9 5
gini ( D)  1        0.459
 14   14 

Suppose the attribute income partitions D into 10 in D1: {low, medium}
 10 
4
and 4 in D2
giniincome{low,medium} ( D)   Gini ( D1 )   Gini ( D1 )
 14 
 14 
but gini{medium,high} is 0.30 and thus the best since it is the lowest

All attributes are assumed continuous-valued

May need other tools, e.g., clustering, to get the possible split values

Can be modified for categorical attributes
July 26, 2016
Data Mining: Concepts and Techniques
31
Examples for computing GINI
GINI (t )  1   [ p( j | t )]
2
j
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
C1
C2
3
3
P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
P(C1) = 3/6
P(C2) = 3/6
Gini = 1 – (3/6)2 – (3/6)2 = 0.5
July 26, 2016
Data Mining: Concepts and Techniques
32
Splitting Based on GINI

When a node p is split into k partitions (children), the
quality of split is computed as,
k
ni
GINI split   GINI (i )
i 1 n
where,
July 26, 2016
ni = number of records at child i,
n = number of records at node p
Data Mining: Concepts and Techniques
33
Binary Attributes: Computing GINI
Index


Splits into two partitions
Effect of Weighing partitions:
 Larger and purer partitions are sought for
Parent
B?
Yes
No
C1
6
C2
6
Gini = 0.500
Gini(N1)
= 1 – (5/7)2 – (2/7)2
= 0.408
Gini(N2)
= 1 – (1/5)2 – (4/5)2
July
26, 2016
= 0.32
Node N1
C1
C2
Node N2
N1
5
2
N2
1
4
Gini(Children)
= 7/12 * 0.408 +
5/12 * 0.32
= 0.371
Data Mining: Concepts and Techniques
34
Categorical Attributes: Computing Gini
Index


For each distinct value, gather counts for each class in the
dataset
Use the count matrix to make decisions
Multi-way split
Two-way split
(find best partition of values)
CarType
Family Sports Luxury
C1
C2
1
4
Gini
July 26, 2016
2
1
0.393
1
1
C1
C2
Gini
CarType
{Sports,
{Family}
Luxury}
3
1
2
4
0.400
Data Mining: Concepts and Techniques
C1
C2
Gini
CarType
{Family,
{Sports}
Luxury}
2
2
1
5
0.419
35
Continuous Attributes: Computing Gini
Index



Use Binary Decisions based on one
value
Several Choices for the splitting
value
 Number of possible splitting
values
= Number of distinct values
Each splitting value has a count
matrix associated with it
 Class counts in each of the
partitions, A < v and A  v
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Taxable
Yes
10
Income
> 80K?
Yes
July 26, 2016
60K
Data Mining: Concepts and Techniques
No
36
Continuous Attributes: Computing Gini
Index...

Tid
For each continuous attribute,

Sort the attribute on values

Choose a split position midway
between any two values

Linearly scan these values, each time
updating the count matrix and
computing gini index

Choose the split position that has the
least gini index
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Cheat
No
No
No
Yes
Yes
Yes
No
No
No
No
100
120
125
220
Taxable Income
60
Sorted Values
55
Split Positions
July 26, 2016
70
75
65
85
72
90
80
95
87
92
97
110
122
172
230
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
Yes
0
3
0
3
0
3
0
3
1
2
2
1
3
0
3
0
3
0
3
0
3
0
No
0
7
1
6
2
5
3
4
3
4
3
4
3
4
4
3
5
2
6
1
7
0
Gini
0.420
0.400
0.375 0.343 0.417 0.400 0.300 0.343
Data Mining: Concepts and Techniques
0.375
0.400
0.420
37
Comparing Attribute Selection Measures

The three measures, in general, return good results but

Information gain:


Gain ratio:


biased towards multivalued attributes
tends to prefer unbalanced splits in which one
partition is much smaller than the others
Gini index:

biased to multivalued attributes

has difficulty when # of classes is large

July 26, 2016
tends to favor tests that result in equal-sized
partitions and purity in both partitions
Data Mining: Concepts and Techniques
38
Overfitting and Tree Pruning

Overfitting: An induced tree may overfit the training data



Too many branches, some may reflect anomalies due to noise or
outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting

Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold


Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees

July 26, 2016
Use a set of data different from the training data to decide
which is the “best pruned tree”
Data Mining: Concepts and Techniques
40
Enhancements to Basic Decision Tree Induction

Allow for continuous-valued attributes



Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
Handle missing attribute values

Assign the most common value of the attribute

Assign probability to each of the possible values
Attribute construction


Create new attributes based on existing ones that are
sparsely represented
This reduces fragmentation, repetition, and replication
July 26, 2016
Data Mining: Concepts and Techniques
41
Classification in Large Databases



Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
Why decision tree induction in data mining?




relatively faster learning speed (than other classification
methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
July 26, 2016
Data Mining: Concepts and Techniques
42
Scalable Decision Tree Induction Methods





SLIQ (EDBT’96 — Mehta et al.)
 Builds an index for each attribute and only class list and
the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)
 Constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
 Integrates tree splitting and tree pruning: stop growing
the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 Builds an AVC-list (attribute, value, class label)
BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)
 Uses bootstrapping to create several small samples
July 26, 2016
Data Mining: Concepts and Techniques
43
Chapter 6. Classification and Prediction


What is classification? What is

Support Vector Machines (SVM)
prediction?

Associative classification
Issues regarding classification

Lazy learners (or learning from
and prediction

your neighbors)
Classification by decision tree
induction

Bayesian classification

Rule-based classification

Classification by back
propagation
July 26, 2016

Other classification methods

Prediction

Accuracy and error measures

Ensemble methods

Model selection

Summary
Data Mining: Concepts and Techniques
44
Using IF-THEN Rules for Classification

Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes


Rule antecedent/precondition vs. rule consequent
Assessment of a rule: coverage and accuracy

ncovers = # of tuples covered by R

ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers

If more than one rule are triggered, need conflict resolution


Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute test)
Class-based ordering: decreasing order of prevalence or misclassification
cost per class

Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
July 26, 2016
Data Mining: Concepts and Techniques
45
Rule Extraction from a Decision Tree
age?



<=30
Rules are easier to understand than large trees
One rule is created for each path from the root
to a leaf
Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
31..40
student?
no
no
>40
credit rating?
yes
yes
excellent
yes

Rules are mutually exclusive and exhaustive

Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no
THEN buys_computer = no
IF age = young AND student = yes
THEN buys_computer = yes
IF age = mid-age
THEN buys_computer = yes
fair
yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair
July 26, 2016
THEN buys_computer = no
Data Mining: Concepts and Techniques
46
Rule Extraction from the Training Data

Sequential covering algorithm: Extracts rules directly from training data

Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER


Rules are learned sequentially, each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes
Steps:




Rules are learned one at a time
Each time a rule is learned, the tuples covered by the rules are
removed
The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
Comp. w. decision-tree induction: learning a set of rules simultaneously
July 26, 2016
Data Mining: Concepts and Techniques
47
Sequential Covering Algorithm
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule
Examples covered
by Rule 2
Examples covered
by Rule 1
Examples covered
by Rule 3
Positive
examples
July 26, 2016
Data Mining: Concepts and Techniques
48
How to Learn-One-Rule?

Star with the most general rule possible: condition = empty

Adding new attributes by adopting a greedy depth-first strategy


Picks the one that most improves the rule quality
Rule-Quality measures: consider both coverage and accuracy

Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition
pos'
pos
FOIL _ Gain  pos'(log 2
 log 2
)
pos' neg '
pos  neg
It favors rules that have high accuracy and cover many positive tuples

Rule pruning based on an independent set of test tuples
FOIL _ Prune( R) 
pos  neg
pos  neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
July 26, 2016
Data Mining: Concepts and Techniques
49
Rule Generation

To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5
A3=1
Positive
examples
July 26, 2016
Negative
examples
Data Mining: Concepts and Techniques
50
Chapter 6. Classification and Prediction


What is classification? What is

Support Vector Machines (SVM)
prediction?

Associative classification
Issues regarding classification

Lazy learners (or learning from
and prediction

your neighbors)
Classification by decision tree
induction

Bayesian classification

Rule-based classification

Classification by back
propagation
July 26, 2016

Other classification methods

Prediction

Accuracy and error measures

Ensemble methods

Model selection

Summary
Data Mining: Concepts and Techniques
51
Lazy vs. Eager Learning



Lazy vs. eager learning
 Lazy learning (e.g., instance-based learning): Simply
stores training data (or only minor processing) and
waits until it is given a test tuple
 Eager learning (the above discussed methods): Given a
set of training set, constructs a classification model
before receiving new (e.g., test) data to classify
Lazy: less time in training but more time in predicting
Accuracy
 Lazy method effectively uses a richer hypothesis space
since it uses many local linear functions to form its
implicit global approximation to the target function
 Eager: must commit to a single hypothesis that covers
the entire instance space
July 26, 2016
Data Mining: Concepts and Techniques
52
Lazy Learner: Instance-Based Methods


Instance-based learning:
 Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
Typical approaches
 k-nearest neighbor approach
 Instances represented as points in a Euclidean
space.
 Locally weighted regression
 Constructs local approximation
 Case-based reasoning
 Uses symbolic representations and knowledgebased inference
July 26, 2016
Data Mining: Concepts and Techniques
53
The k-Nearest Neighbor Algorithm





All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1NN for a typical set of training examples
.
_
_
+
_
_
July 26, 2016
_
.
+
+
xq
_
+
.
.
.
Data Mining: Concepts and Techniques
.
54
Discussion on the k-NN Algorithm

k-NN for real-valued prediction for a given unknown tuple


Returns the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm

Weight the contribution of each of the k neighbors
according to their distance to the query xq
1



Give greater weight to closer neighbors
w
d ( xq , x )2
i
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes

To overcome it, axes stretch or elimination of the least
relevant attributes
July 26, 2016
Data Mining: Concepts and Techniques
55
Case-Based Reasoning (CBR)


CBR: Uses a database of problem solutions to solve new problems
Store symbolic description (tuples or cases)—not points in a Euclidean
space

Applications: Customer-service (product-related diagnosis), legal ruling

Methodology




Instances represented by rich symbolic descriptions (e.g., function
graphs)
Search for similar cases, multiple retrieved cases may be combined
Tight coupling between case retrieval, knowledge-based reasoning,
and problem solving
Challenges


Find a good similarity metric
Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases
July 26, 2016
Data Mining: Concepts and Techniques
56
Download