Chapter 9 DATA MINING PAULA JENSEN SDSM&T ENGM 745 McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved. DATA MINING DATA-DATA Extracting of useful information from large databases Tools of Data Mining Looking at where to find the data TOOLS OF DATA MINING Prediction Classification Clustering Association PREDICTION Predict the value of a numeric variable Customer’s expenditure Will they purchase What are their interests Do their interests predict a purchase CLASSIFICATION Classes of objects or actions Reliability of customer Income Location CLUSTERING Analysis tools analyze objects viewed as a class Where is the cut off of income or size How do I group the information ASSOCIATION Patterns based on likes Netflix Facebook Google CLASSIFICATION k-nearest neighbor Naïve Bayes Classification/regression trees Logistic Regression DATA MINING TERMINOLOGY 9-10 9-11 9-12 K-NEAREST NEIGHBOR Use Subset of total data called training data Select closest neighbor with Euclidian distance shown in previous slide other metrics available to measure to define neighbors Validation data is a separate set of data Test statistic important on the validation data versus the training data 60% of data training data and 40% validation data acceptable mix 9-14 9-15 K-NEAREST NEIGHBOR ANALYSIS Multidimentional Program is going to compute a distance associated to each attribute Continuous Variables are measured in different scales Categorical attributes will use a weighted mechanism Example is will they respond to marketing to take a loan K=3 means used 3 neighbors to classify all records 9-17 Type 1 would take a loan – Type 0 would not take a loan 9-18 9-19 TERMS Lift – measures the change in concentration of a particular class when the model is used to select a group from the general population. Significant lift on the example. Decile Wise chart- Pick the top 10% of our records classified by our model our selection would include approximately 7 times as many correct classifications. Classification Trees 9-21 9-22 9-23 CLASSIFICATION TREES Advantages Decision rules are easy Easy to understand Disadvantages Overfit data Correlated attributes will cause multicollinearity 9-25 9-26 9-27 9-28 NAÏVE BAYSES Statistical Classification Bayes Therom: predicts the probability of a prior event given a certain subsequent event has taken place Called Naïve because each attribute is assumed as independent 9-30 9-31 9-32 9-33 BAYESIAN THEOREM P (A|B) = (P(B|A))* P(A) P(B) P(A) is the prior probability P (A|B) is conditional probability of A, given B P (B|A) is the conditional probability of B given A P (B) is the prior probability of B 9-35 APPLYING BAYES’ THEROM REGRESSION Logistic regression or Logit analysis Difference between logics regression and ordinary regression is that the dependent variable in logistic regression is categorical not continuous Dependent Variable is Dichtomous- either yes or no Dependent variable is either will be limited to values between 0 and 1 9-38 9-39 9-40 9-41 9-42 9-43 9-44 9-45 9-46 WHERE DO I FIND THE DATA??? Current Customer Activity Collect in your database Family names Sales software Forms from your website Wufoo.com Track inquiries Current Facebook Activity BUY IT! Mailing lists How to use it???