Chapter 2
Data Mining: A Closer Look
1
2.1 Data Mining Strategies
2
Data Mining
Strategies
Unsupervised
Clustering
Supervised
Learning
Market Basket
Analysis
Classification
Estimation
Prediction
3
Supervised learning
Output attributes: dependent variables
 Input attributes: independent variables

4
Classification: Characteristics
Learning is supervised.
• The dependent variable is categorical.
• Well-defined classes.
• Current rather than future behavior.
•
5
Classification: examples
Individuals who suffered a heart attack or
not
 A profile of a successful person
 Credit card fraud purchase
 Car loan applicant

6
Estimation
Learning is supervised.
• The dependent variable is numeric.
•
•Well-defined
•
classes.
Current rather than future behavior.
7
examples
A thunderstorm future location
 The salary of a sports car owner
 The likelihood that a credit card has been
stolen

• Can be transformed to classification
• Using probability
8
Prediction
•
•
The emphasis is on predicting future
rather than current outcomes.
The output attribute may be categorical
or numeric.
9
examples
The total number of touchdowns an NFL
running back will score
 Take advantage of a special offer made
available with their credit card billing
 Stock price
 Telephone subscribers are likely to change
providers

10
The Cardiology Patient Dataset
11
Table 2.1 • Cardiology Patient Data
Attribute
Name
Mixed
Values
Numeric
Values
Comments
Age
Numeric
Numeric
Age in years
Sex
Male, Female
1, 0
Patient gender
Chest Pain Type
Angina, Abnormal Angina,
=Nonanginal
NoTang, Asymptomatic
1–4
Blood Pressure
Numeric
Numeric
Resting blood pressure
upon hospital admission
Cholesterol
Numeric
Numeric
Serum cholesterol
Fasting Blood
Sugar < 120
True, False
1, 0
Is fasting blood sugar less
than 120?
Resting ECG
ventricular
Normal, Abnormal, Hyp
NoTang
pain
0, 1, 2
Hyp = Left
hypertrophy
12
Table 2.1 • Cardiology Patient Data
Attribute
Name
Maximum Heart
Rate
Mixed
Values
Numeric
Values
Numeric
Numeric
Maximum heart rate
achieved
1, 0
Does the patient experience
Induced Angina? True, False
angina
Comments
as a result of exercise?
Old Peak
exercise
Numeric
Numeric
ST depression induced by
relative to rest
Slope
ST
Up, flat, down
1–3
Slope of the peak exercise
segment
Number Colored 0, 1, 2, 3
vessels
Vessels
0, 1, 2, 3
Number of major
Thal
Normal fix, rev
3, 6, 7
Normal, fixed defect,
reversible defect
Concept Class
Healthy, Sick
1, 0
Angiographic disease status
colored by fluorosopy
13
Table 2.2 • Most and Least Typical Instances from the Cardiology Domain
Attribute
Name
Age
Sex
Chest Pain Type
Blood Pressure
Cholesterol
Fasting Blood Sugar < 120
Resting ECG
Maximum Heart Rate
Induced Angina?
Old Peak
Slope
Number of Colored Vessels
Thal
Most Typical Least Typical Most Typical
Healthy
Healthy Class Sick Class
Class
52
Male
NoTang
138
223
False
Normal
169
False
0
Up
0
Normal
63
Male
Angina
145
233
True
Hyp
150
False
2.3
Down
0
Fix
Least Typical
Sick Class
60
62
Male
Female
Asymptom Asymptom
125
160
258
164
False
False
Hyp
Hyp
141
145
True
False
2.8
6.2
Flat
Down
1
3
Rev
Rev
14
A Healthy Class Rule for the
Cardiology Patient Dataset
IF 169 <= Maximum Heart Rate <=202
THEN Concept Class = Healthy
Rule accuracy: 85.07%
Rule coverage: 34.55%
•
Production Rules
• Rule accuracy is a between-class measure.
• Rule coverage is a within-class measure.
15
A Sick Class Rule for the Cardiology
Patient Dataset
IF Thal = Rev & Chest Pain Type = Asymptomatic
THEN Concept Class = Sick
Rule accuracy: 91.14%
Rule coverage: 52.17%
16
Unsupervised Clustering
• Determine if concepts can be found in the data.
Evaluate the likely performance of a supervised
model.
• Determine a best set of input attributes for
supervised learning.
• Detect Outliers.
•
17
U.C. on the Cardiology Patient Data
Flag Concept Class as unused
 Examine the output of U.C. to determine if
the instances from Concept Class naturally
cluster together
 If no, repeatedly apply U.C. with alternative
attribute choices

18
Market Basket Analysis
•
•
Find interesting relationships among
retail products.
Uses association rule algorithms.
19
2.2 Supervised Data Mining
Techniques
20
Table 2.3 • The Credit Card Promotion Database
Income
Range ($)
Magazine
Promotion
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
No
No
Yes
No
No
Watch
Life Insurance
Promotion
Promotion
No
Yes
No
Yes
No
No
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
Credit Card
Insurance
Sex
Age
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
21
A Hypothesis for the Credit Card
Promotion Database
A combination of one or more of the dataset
attributes differentiate Acme Credit Card
Company card holders who have taken
advantage of the life insurance promotion and
those card holders who have chosen not to
participate in the promotional offer.
22
A Production Rule for the
Credit Card Promotion Database
IF Sex = Female & 19 <=Age <= 43
THEN Life Insurance Promotion = Yes
Rule Accuracy: 100.00%
Rule Coverage: 66.67%
IF Sex = Male & Income Range = 40-50K
THEN Life Insurance Promotion = No
Rule Accuracy: 100.00%
Rule Coverage: 50.00%
23
Neural Networks
24
Input
Layer
Hidden
Layer
Output
Layer
25
Statistical Regression
Life insurance promotion =
0.5909 (credit card insurance) 0.5455 (sex) + 0.7727
26
Table 2.4 • Neural Network Training: Actual and Computed Output
ID
Life Insurance Promotion
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0
1
0
1
1
0
1
0
0
1
1
1
1
0
Computed Output
0.024
0.998
0.023
0.986
0.999
0.050
0.999
0.262
0.060
0.997
0.999
0.776
0.999
0.023
27
2.3 Association Rules
28
Table 2.3 • The Credit Card Promotion Database
Income
Range ($)
Magazine
Promotion
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
No
No
Yes
No
No
Watch
Life Insurance
Promotion
Promotion
No
Yes
No
Yes
No
No
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
Credit Card
Insurance
Sex
Age
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
29
An Association Rule for the Credit Card
Promotion Database
IF Sex = Female & Age = over40 &
Credit Card Insurance = No
THEN Life Insurance Promotion = Yes
IF Sex = Male & Age = over40 &
Credit Card Insurance = No
THEN Life Insurance Promotion = No
IF Sex = Female & Age = over40 THEN
Credit Card Insurance = No & Life
Insurance Promotion = Yes
30
Association rule advantages
Have one or several output attributes
 An output attribute for one rule can be
an input attribute for another rule

31
2.4 Clustering Techniques
32
Table 2.3 • The Credit Card Promotion Database
Income
Range ($)
Magazine
Promotion
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
No
No
Yes
No
No
Watch
Life Insurance
Promotion
Promotion
No
Yes
No
Yes
No
No
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
Credit Card
Insurance
Sex
Age
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
33
Cluster 1
# Instances: 3
Sex: Male => 3
Female => 0
Age: 43.3
Credit Card Insurance:
Yes => 0
No => 3
Life Insurance Promotion: Yes => 0
No => 3
Cluster 2
# Instances: 5
Sex: Male => 3
Female => 2
Age: 37.0
Credit Card Insurance:
Yes => 1
No => 4
Life Insurance Promotion: Yes => 2
No => 3
Cluster 3
# Instances: 7
Sex: Male => 2
Female => 5
Age: 39.9
Credit Card Insurance:
Yes => 2
No => 5
Life Insurance Promotion: Yes => 7
No => 0
34
Rule for the 3rd cluster
IF Sex=Female & 43 >=Age>=35 & Credit
Card Insurance = No
THEN Class = 3
 Rule Accuracy: 100%
 Rule Coverage: 66.67%
35
2.5 Evaluating Performance
36
Evaluating Supervised Learner
Models
37
General questions

Will the benefits received from a data
mining project more than offset the cost
of the data mining process?
◦ Require the business model knowledge
How do we interpret the results of a data
mining session?
 Can we use the results of a data mining
process with confidence?

38
Confusion Matrix
•
•
•
A matrix used to summarize the results
of a supervised classification.
Entries along the main diagonal are
correct classifications.
Entries other than those on the main
diagonal are classification errors.
39
Table 2.5 • A Three-Class Confusion Matrix
Computed Decision
C
1
C
11
C
2
C
12
C
21
C
22
C
23
31
C
32
C
33
C
1
C
2
C
3
C
C
3
C
13
40
Two-Class Error Analysis
41
Table 2.6 • A Simple Confusion Matrix
Accept
Reject
Computed Computed
Accept Reject
True
False
Accept
Reject
False
Accept
True
Reject
42
Table 2.7 • Two Confusion Matrices Each Showing a 10% Error Rate
Model
Computed
Computed
Model
Computed
Computed
A
Accept
Reject
B
Accept
Reject
Accept
Reject
600
75
25
300
Accept
Reject
600
25
75
300
• Which is better?
• Given that credit card purchases are
unsecured, choose Model B.
• 寧缺勿濫
43
Evaluating Numeric Output
• When
the output attribute is
numeric
Mean absolute error
• Mean squared error
• Root mean squared error
•
•= SD, standard deviation
44
Comparing Models by
Measuring Lift
45
1200
1000
Number
Responding
800
600
400
200
0
0
10
20
30
40
50
60
70
80
90
100
% Sampled
46
Computing Lift
Lift 
P(Ci | Sample)
P(Ci | Population
)
47
Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model
Computed
No
Model Accept
Accept
Reject
1,000
99,000
Computed
Ideal
Computed
Computed
Reject
Model
Accept
Reject
0
0
Accept
Reject
1,000
0
0
99,000
48
Table 2.9 • Two Confusion Matrices for Alternative Models with Lift
Equal to 2.25
Model
Computed
Computed
Model
Computed
Computed
X
Accept
Reject
Y
Accept
Reject
Accept
540
Reject 23,460
460
75,540
Accept
Reject
450
19,550
550
79,450
• Which is better?
• 540/24000 = 450/20000
• 4000 fewer mailings vs. 90 fewer sales
49
Unsupervised Model Evaluation
• Perform an unsupervised clustering.
Assign each cluster an arbitrary name. ex.
C1, C2, and C3
• Choose a random sample of
instances from each of the classes
formed as a result of the instance
clustering.
• Build a supervised learner
model
50
Chapter 3
Basic Data Mining
Techniques
51