Chapter 2 Data Mining: A Closer Look 1 2.1 Data Mining Strategies 2 Data Mining Strategies Unsupervised Clustering Supervised Learning Market Basket Analysis Classification Estimation Prediction 3 Supervised learning Output attributes: dependent variables Input attributes: independent variables 4 Classification: Characteristics Learning is supervised. • The dependent variable is categorical. • Well-defined classes. • Current rather than future behavior. • 5 Classification: examples Individuals who suffered a heart attack or not A profile of a successful person Credit card fraud purchase Car loan applicant 6 Estimation Learning is supervised. • The dependent variable is numeric. • •Well-defined • classes. Current rather than future behavior. 7 examples A thunderstorm future location The salary of a sports car owner The likelihood that a credit card has been stolen • Can be transformed to classification • Using probability 8 Prediction • • The emphasis is on predicting future rather than current outcomes. The output attribute may be categorical or numeric. 9 examples The total number of touchdowns an NFL running back will score Take advantage of a special offer made available with their credit card billing Stock price Telephone subscribers are likely to change providers 10 The Cardiology Patient Dataset 11 Table 2.1 • Cardiology Patient Data Attribute Name Mixed Values Numeric Values Comments Age Numeric Numeric Age in years Sex Male, Female 1, 0 Patient gender Chest Pain Type Angina, Abnormal Angina, =Nonanginal NoTang, Asymptomatic 1–4 Blood Pressure Numeric Numeric Resting blood pressure upon hospital admission Cholesterol Numeric Numeric Serum cholesterol Fasting Blood Sugar < 120 True, False 1, 0 Is fasting blood sugar less than 120? Resting ECG ventricular Normal, Abnormal, Hyp NoTang pain 0, 1, 2 Hyp = Left hypertrophy 12 Table 2.1 • Cardiology Patient Data Attribute Name Maximum Heart Rate Mixed Values Numeric Values Numeric Numeric Maximum heart rate achieved 1, 0 Does the patient experience Induced Angina? True, False angina Comments as a result of exercise? Old Peak exercise Numeric Numeric ST depression induced by relative to rest Slope ST Up, flat, down 1–3 Slope of the peak exercise segment Number Colored 0, 1, 2, 3 vessels Vessels 0, 1, 2, 3 Number of major Thal Normal fix, rev 3, 6, 7 Normal, fixed defect, reversible defect Concept Class Healthy, Sick 1, 0 Angiographic disease status colored by fluorosopy 13 Table 2.2 • Most and Least Typical Instances from the Cardiology Domain Attribute Name Age Sex Chest Pain Type Blood Pressure Cholesterol Fasting Blood Sugar < 120 Resting ECG Maximum Heart Rate Induced Angina? Old Peak Slope Number of Colored Vessels Thal Most Typical Least Typical Most Typical Healthy Healthy Class Sick Class Class 52 Male NoTang 138 223 False Normal 169 False 0 Up 0 Normal 63 Male Angina 145 233 True Hyp 150 False 2.3 Down 0 Fix Least Typical Sick Class 60 62 Male Female Asymptom Asymptom 125 160 258 164 False False Hyp Hyp 141 145 True False 2.8 6.2 Flat Down 1 3 Rev Rev 14 A Healthy Class Rule for the Cardiology Patient Dataset IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55% • Production Rules • Rule accuracy is a between-class measure. • Rule coverage is a within-class measure. 15 A Sick Class Rule for the Cardiology Patient Dataset IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17% 16 Unsupervised Clustering • Determine if concepts can be found in the data. Evaluate the likely performance of a supervised model. • Determine a best set of input attributes for supervised learning. • Detect Outliers. • 17 U.C. on the Cardiology Patient Data Flag Concept Class as unused Examine the output of U.C. to determine if the instances from Concept Class naturally cluster together If no, repeatedly apply U.C. with alternative attribute choices 18 Market Basket Analysis • • Find interesting relationships among retail products. Uses association rule algorithms. 19 2.2 Supervised Data Mining Techniques 20 Table 2.3 • The Credit Card Promotion Database Income Range ($) Magazine Promotion 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K Yes Yes No Yes Yes No Yes No Yes Yes No No Yes No No Watch Life Insurance Promotion Promotion No Yes No Yes No No No Yes No Yes Yes Yes Yes Yes No No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes Credit Card Insurance Sex Age No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 21 A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer. 22 A Production Rule for the Credit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: 100.00% Rule Coverage: 66.67% IF Sex = Male & Income Range = 40-50K THEN Life Insurance Promotion = No Rule Accuracy: 100.00% Rule Coverage: 50.00% 23 Neural Networks 24 Input Layer Hidden Layer Output Layer 25 Statistical Regression Life insurance promotion = 0.5909 (credit card insurance) 0.5455 (sex) + 0.7727 26 Table 2.4 • Neural Network Training: Actual and Computed Output ID Life Insurance Promotion 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 0 1 1 0 1 0 0 1 1 1 1 0 Computed Output 0.024 0.998 0.023 0.986 0.999 0.050 0.999 0.262 0.060 0.997 0.999 0.776 0.999 0.023 27 2.3 Association Rules 28 Table 2.3 • The Credit Card Promotion Database Income Range ($) Magazine Promotion 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K Yes Yes No Yes Yes No Yes No Yes Yes No No Yes No No Watch Life Insurance Promotion Promotion No Yes No Yes No No No Yes No Yes Yes Yes Yes Yes No No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes Credit Card Insurance Sex Age No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 29 An Association Rule for the Credit Card Promotion Database IF Sex = Female & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = Yes IF Sex = Male & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = No IF Sex = Female & Age = over40 THEN Credit Card Insurance = No & Life Insurance Promotion = Yes 30 Association rule advantages Have one or several output attributes An output attribute for one rule can be an input attribute for another rule 31 2.4 Clustering Techniques 32 Table 2.3 • The Credit Card Promotion Database Income Range ($) Magazine Promotion 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K Yes Yes No Yes Yes No Yes No Yes Yes No No Yes No No Watch Life Insurance Promotion Promotion No Yes No Yes No No No Yes No Yes Yes Yes Yes Yes No No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes Credit Card Insurance Sex Age No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 33 Cluster 1 # Instances: 3 Sex: Male => 3 Female => 0 Age: 43.3 Credit Card Insurance: Yes => 0 No => 3 Life Insurance Promotion: Yes => 0 No => 3 Cluster 2 # Instances: 5 Sex: Male => 3 Female => 2 Age: 37.0 Credit Card Insurance: Yes => 1 No => 4 Life Insurance Promotion: Yes => 2 No => 3 Cluster 3 # Instances: 7 Sex: Male => 2 Female => 5 Age: 39.9 Credit Card Insurance: Yes => 2 No => 5 Life Insurance Promotion: Yes => 7 No => 0 34 Rule for the 3rd cluster IF Sex=Female & 43 >=Age>=35 & Credit Card Insurance = No THEN Class = 3 Rule Accuracy: 100% Rule Coverage: 66.67% 35 2.5 Evaluating Performance 36 Evaluating Supervised Learner Models 37 General questions Will the benefits received from a data mining project more than offset the cost of the data mining process? ◦ Require the business model knowledge How do we interpret the results of a data mining session? Can we use the results of a data mining process with confidence? 38 Confusion Matrix • • • A matrix used to summarize the results of a supervised classification. Entries along the main diagonal are correct classifications. Entries other than those on the main diagonal are classification errors. 39 Table 2.5 • A Three-Class Confusion Matrix Computed Decision C 1 C 11 C 2 C 12 C 21 C 22 C 23 31 C 32 C 33 C 1 C 2 C 3 C C 3 C 13 40 Two-Class Error Analysis 41 Table 2.6 • A Simple Confusion Matrix Accept Reject Computed Computed Accept Reject True False Accept Reject False Accept True Reject 42 Table 2.7 • Two Confusion Matrices Each Showing a 10% Error Rate Model Computed Computed Model Computed Computed A Accept Reject B Accept Reject Accept Reject 600 75 25 300 Accept Reject 600 25 75 300 • Which is better? • Given that credit card purchases are unsecured, choose Model B. • 寧缺勿濫 43 Evaluating Numeric Output • When the output attribute is numeric Mean absolute error • Mean squared error • Root mean squared error • •= SD, standard deviation 44 Comparing Models by Measuring Lift 45 1200 1000 Number Responding 800 600 400 200 0 0 10 20 30 40 50 60 70 80 90 100 % Sampled 46 Computing Lift Lift P(Ci | Sample) P(Ci | Population ) 47 Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model Computed No Model Accept Accept Reject 1,000 99,000 Computed Ideal Computed Computed Reject Model Accept Reject 0 0 Accept Reject 1,000 0 0 99,000 48 Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25 Model Computed Computed Model Computed Computed X Accept Reject Y Accept Reject Accept 540 Reject 23,460 460 75,540 Accept Reject 450 19,550 550 79,450 • Which is better? • 540/24000 = 450/20000 • 4000 fewer mailings vs. 90 fewer sales 49 Unsupervised Model Evaluation • Perform an unsupervised clustering. Assign each cluster an arbitrary name. ex. C1, C2, and C3 • Choose a random sample of instances from each of the classes formed as a result of the instance clustering. • Build a supervised learner model 50 Chapter 3 Basic Data Mining Techniques 51