Chapter 2 Overview of the Data Mining Process 1 Introduction • Data Mining – Predictive analysis • Tasks of Classification & Prediction • Core of Business Intelligence • Data Base Methods – OLAP – SQL – Do not involve statistical modeling 2 Core Ideas in Data Mining • Analytical Methods Used in Predictive Analytics – Classification • Used with categorical response variables • E.g. Will purchase be made / not made? – Prediction • Predict (estimate) value of continuous response variable • Prediction used with categorical as well – Association Rules • Affinity analysis – “what goes with what” • Seeks correlations among data 3 Core Ideas in Data Mining • Data Reduction – Reduce variables – Group together similar variables • Data Exploration – View data as evidence – Get “a feel” for the data • Data Visualization – Graphical representation of data – Locate tends, correlations, etc. 4 Supervised Learning • “Supervised learning" algorithms are those used in classification and prediction. – Data is available in which the value of the outcome of interest is known. • “Training data" are the data from which the classification or prediction algorithm “learns," or is “trained," about the relationship between predictor variables and the outcome variable. • This process results in a “model” – Classification Model – Predictive Model 5 Supervised Learning • Model is then run with another sample of data – “validation data" – the outcome is known but we wish to see how well the model performs – If many different models are being tried out, a third sample of known outcomes -“test data” is used with the final, selected model to predict how well it will do. • The model can then be used to classify or predict the outcome of interest in new cases where the outcome is unknown. 6 Supervised Learning • Linear regression analysis is an example of supervised Learning – The Y variable is the (known) outcome variable – The X variable is some predictor variable. – A regression line is drawn to minimize the sum of squared deviations between the actual Y values and the values predicted by this line. – The regression line can now be used to predict Y values for new values of X for which we do not know the Y value. 7 Unsupervised Learning • No outcome variable to predict or classify • No “learning” from cases • Unsupervised leaning methods – Association Rules – Data Reduction Methods – Clustering Techniques 8 The Steps in Data Mining • 1. Develop an understanding of the purpose of the data mining project – It is a one-shot effort to answer a question or questions or – Application (if it is an ongoing procedure). • 2. Obtain the dataset to be used in the analysis. – Random sampling from a large database to capture records to be used in an analysis – Pulling together data from different databases. • Internal (e.g. Past purchases made by customers) • External (credit ratings). – Usually the analysis to be done requires only thousands or tens of thousands of records. 9 The Steps in Data Mining • 3. Explore, clean, and preprocess the data – Verifying that the data are in reasonable condition. – How missing data should be handled? – Are the values in a reasonable range, given what you would expect for each variable? – Are there obvious “outliers?" – Data are reviewed graphically – • For example, a matrix of scatter plots showing the relationship of each variable with each other variable. – Ensure consistency in the definitions of fields, units of measurement, time periods, etc. 10 The Steps in Data Mining • 4. Reduce the data – If supervised training is involved separate them into training, validation and test datasets. – Eliminating unneeded variables, • Transforming variables – Turning “money spent" into “spent > $100" vs. “Spent · $100"), • Creating new variables – A variable that records whether at least one of several products was purchased – Make sure you know what each variable means, and whether it is sensible to include it in the model. • 5. Determine the data mining task – Classification, prediction, clustering, etc. • 6. Choose the data mining techniques to be used – Regression, neural nets, hierarchical clustering, etc. 11 The Steps in Data Mining • 7. Use algorithms to perform the task. – Iterative process - trying multiple variants, and often using multiple variants of the same algorithm (choosing different variables or settings within the algorithm). – When appropriate, feedback from the algorithm's performance on validation data is used to refine the settings. • 8. Interpret the results of the algorithms. – Choose the best algorithm to deploy, – Use final choice on the test data to get an idea how well it will perform. • 9. Deploy the model. – Integrate the model into operational systems – Run it on real records to produce decisions or actions. – For example, the model might be applied to a purchased list of possible customers, and the action might be “include in the mailing if the predicted amount of purchase is > $10." 12 Preliminary Steps • Organization of datasets – Records in rows – Variables in columns • In supervised learning one of these will be the outcome variable • Labels the first or last column • Sampling from a database – Use a samples to create, validate, & test model • Oversampling rare events – If response variable value is seldom found in data then sample size increase – Adjust algorithm as necessary 13 Preliminary Steps (Pre-processing and Cleaning the Data) • Types of variables – Continuous – assumes a any real numerical value (generally within a specified range) – Categorical – assumes one of a limited number of values • • • • Text (e.g. Payments e {current, not current, bankrupt} Numerical (e.g. Age e {0 … 120} ) Nominal (payments) Ordinal (age) 14 Preliminary Steps (Pre-processing and Cleaning the Data) • Handling categorical variables – If categorical is ordered then it can be used as continuous variable (e.g. Age, level of credit, etc.) – Use of “dummy” variables when range of values not large • e.g. Variable occupation e {student, unemployed, employed, retired} • Create binary (yes/no) dummy variables – – – – Student – yes/no Unemployed – yes/no Employed – yes/no Retired – yes/no • Variable selection – The more predictor variables the more records need to build the model – Reduce number of variables whenever appropriate 15 Preliminary Steps (Pre-processing and Cleaning the Data) • Overfitting – Building a model - describe relationships among variables in order to predict future outcome (dependent) values on the basis of future predictor (independent) values. – Avoid “explaining“ variation in the data that was nothing more than chance variation. Avoid mislabeling “noise” in the data as if it were a “signal” – Caution - if the dataset is not much larger than the number of predictor variables, then it is very likely that a spurious relationship like this will creep into the model 16 Overfitting 17 Preliminary Steps (Pre-processing and Cleaning the Data) • How many variables & how much data • A good rule of thumb is to have ten records for every predictor variable. • For classification procedures – At least 6xmxp records, – Where m = number of outcome classes, and p = number of variables • Compactness or parsimony is a desirable feature in a model. • A matrix of x-y plots can be useful in variable selection. • Can see at a glance x-y plots for all variable combinations. – A straight line would be an indication that one variable is exactly correlated with another. – We would want to include only one of them in our model. • Weed out irrelevant and redundant variables from our model • Consult domain expert whenever possible 18 Preliminary Steps (Pre-processing and Cleaning the Data) • Outliers – Values that lie far away from the bulk of the data are called outliers – no statistical rule can tell us whether such an outlier is the result of an error – these are judgments best made by someone with “domain" knowledge – if the number of records with outliers is very small, they might be treated as missing data. 19 Preliminary Steps (Pre-processing and Cleaning the Data) • Missing values – If the number of records with missing values is small, those records might be omitted – The more variables, the more records to dropped • Solution - use average value computed from records with valid data for variable with missing data • Reduces variability in data set – Human judgment can be used to determine best way to handle missing data 20 Preliminary Steps (Pre-processing and Cleaning the Data) • Normalizing (standardizing) the data – To normalize the data, we subtract the mean from each value, and divide by the standard deviation of the resulting deviations from the mean • Expressing each value as “number of standard deviations away from the mean“ – the z-score • Needed if variables are in different units e.G. Hours, thousands of dollars, etc. – Clustering algorithms measure variables values in distance from each other – need a standard value for distance. – Data mining software, including XLMiner, typically has an option that normalizes the data in those algorithms where it may be required 21 Preliminary Steps • Use and creation of partition – Training partition • The largest partition • Contains the data used to build the various models • Same training partition is generally used to develop multiple models. – Validation partition • Used to assess the performance of each model, • Used to compare models and pick the best one. • In classification and regression trees algorithms the validation partition may be used automatically to tune and improve the model. – Test partition • Sometimes called the “holdout" or “evaluation" partition is used to assess the performance of a chosen model with new data. 22 The Three Data Partitions and Their Role in the Data Mining Process 23 A Example – Linear Regression Boston Housing Data B C D E F G H I J DIS RAD K TAX PTRATIO L M N O CAT. B LSTAT MEDV MEDV CRIM ZN INDUS CHAS NOX RM AGE 0.006 18 2.31 0 0.54 6.58 65.2 4.09 1 296 15.3 397 5 24 0 0.027 0 7.07 0 0.47 6.42 78.9 4.97 2 242 17.8 397 9 21.6 0 0.027 0 7.07 0 0.47 7.19 61.1 4.97 2 242 17.8 393 4 34.7 1 0.032 0 2.18 0 0.46 7.00 45.8 6.06 3 222 18.7 395 3 33.4 1 0.069 0 2.18 0 0.46 7.15 54.2 6.06 3 222 18.7 397 5 36.2 1 0.030 0 2.18 0 0.46 6.43 58.7 6.06 3 222 18.7 394 5 28.7 0 0.088 12.5 7.87 0 0.52 6.01 66.6 5.56 5 311 15.2 396 12 22.9 0 0.145 12.5 7.87 0 0.52 6.17 96.1 5.95 5 311 15.2 397 19 27.1 0 0.211 12.5 7.87 0 0.52 5.63 100 6.08 5 311 15.2 387 30 16.5 0 0.170 12.5 7.87 0 0.52 6.00 85.9 6.59 5 311 15.2 387 17 18.9 0 24 CRIM per capita crime rate by town ZN proportion of residential land zoned for lots over 25,000 sq.ft. INDUS proportion of non-retail business acres per town. CHAS Charles River dummy variable (1 if tract bounds river; 0 otherwise) NOX nitric oxides concentration (parts per 10 million) RM average number of rooms per dwelling AGE proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centres RAD index of accessibility to radial highways TAX full-value property-tax rate per $10,000 PTRATIO pupil-teacher ratio by town B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town LSTAT % lower status of the population MEDV Median value of owner-occupied homes in $1000 25 Partitioning the data 26 Using XLMiner for Multiple Linear Regression 27 Specifying Output 28 Prediction of Training Data Row Id. 1 4 5 6 9 10 12 17 18 Predicted Value 30.24690555 28.61652272 27.76434086 25.6204032 11.54583087 19.13566187 21.95655773 20.80054199 16.94685562 Actual Value Residual 24 -6.246905549 33.4 4.783477282 36.2 8.435659135 28.7 3.079596801 16.5 4.954169128 18.9 -0.235661871 18.9 -3.05655773 23.1 2.299458015 17.5 0.553144385 29 Prediction of Validation Data Row Id. 2 3 7 8 11 13 14 15 16 Predicted Value 25.03555247 30.1845219 23.39322259 19.58824389 18.83048747 21.20113865 19.81376359 19.42217211 19.63108414 Actual Value Residual 21.6 34.7 22.9 27.1 15 21.7 20.4 18.2 19.9 -3.435552468 4.515478101 -0.493222593 7.511756109 -3.830487466 0.498861352 0.586236414 -1.222172107 0.268915856 30 Summary of errors Training Data scoring - Summary Report Total sum of squared errors 6977.106 RMS Error Average Error 4.790720883 3.11245E-07 Validation Data scoring - Summary Report Total sum of squared errors 4251.582211 RMS Error Average Error 4.587748542 -0.011138034 31 RMS error • Error = actual - predicted • RMS = Root-mean-squared error • = Square root of average squared error • In previous example, sizes of training and validation sets differ, so only RMS Error and Average Error are comparable 32 Using Excel and XLMiner for Data Mining • Excel is limited in data capacity • However, the training and validation of DM models can be handled within the modest limits of Excel and XLMiner • Models can then be used to score larger databases • XLMiner has functions for interacting with various databases (taking samples from a database, and scoring a database from a developed model) 33 Simple Regression Example 34 Simple Regression Model • Make prediction about the starting salary of a current college graduate • Data set of starting salaries of recent college graduates Data Set Compute Average Salary How certain are of this prediction? There is variability in the data. 35 Simple Regression Model • Use total variation as an index of uncertainty about our prediction Compute Total Variation • The smaller the amount of total variation the more accurate (certain) will be our prediction. 36 Simple Regression Model • How “explain” the variability - Perhaps it depends on the student’s GPA Salary GPA 37 Simple Regression Model • Find a linear relationship between GPA and starting salary • As GPA increases/decreases starting salary increases/decreases 38 Simple Regression Model • Least Squares Method to find regression model – Choose a and b in regression model (equation) so that it minimizes the sum of the squared deviations – actual Y value minus predicted Y value (Y-hat) 39 Simple Regression Model • How good is the model? a= 4,779 & b = 5,370 A computer program computed these values u-hat is a “residual” value The sum of all u-hats is zero The sum of all u-hats squared is the total variance not explained by the model “unexplained variance” is 7,425,926 40 Simple Regression Model Total Variation = 23,000,000 41 Simple Regression Model Total Unexplained Variation = 7,425,726 42 Simple Regression Model • Relative Goodness of Fit – Summarize the improvement in prediction using regression model • Compute R2 – coefficient of determination Regression Model (equation) a better predictor than guessing the average salary The GPA is a more accurate predictor of starting salary than guessing the average R2 is the “performance measure“ for the model. Predicted Starting Salary = 4,779 + 5,370 * GPA 43 Detailed Regression Example 44 Data Set Obs # 1 2 3 4 5 6 7 8 9 10 Salary 20000 24500 23000 25000 20000 22500 27500 19000 24000 28500 GPA 2.8 3.4 3.2 3.8 3.2 3.4 4.0 2.6 3.2 3.8 Months Work 48 24 24 24 48 36 20 48 36 12 45 Scatter Plot - GPA vs Salary 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0 5000 10000 15000 20000 25000 30000 46 Scatter Plot - Work vs Salary 60 50 40 30 20 10 0 0 5000 10000 15000 20000 25000 30000 47 Pearson Correlation Coefficients -1 <= r <= 1 Salary Salary GPA Months Work Months Work GPA 1 0.898007 1 -0.93927 -0.82993 1 48 Three Regressions • • • • Salary = f(GPA) Salary = f(Work) Salary = f(GPA, Work) Interpret Excel Output 49 Interpreting Results • Regression Statistics – Multiple R, – R2, – R2adj – Standard Error Sy • Statistical Significance – t-test – p-value – F test 50 Regression Statistics Table • Multiple R – R = square root of R2 • R2 – Coefficient of Determination • R2adj – used if more than one x variable • Standard Error Sy – This is the sample estimate of the standard deviation of the error (actual – predicted) 51 ANOVA Table • Table 1 gives the F statistic • Tests the claim – there is no significant relationship between your all of your independent and dependent variables • The significance F value is a p-value • should reject the claim: – Of NO significant relationship between your independent and dependent variables if p< – Generally = 0.05 52 Regression Coefficients Table • Coefficients Column gives – b0 , b1, ,b2 , … , bn values for the regression equation. – The b0 is the intercept – b1value is next to your independent variable x1 – b2 is next to your independent variable x2. – b3 is next to your independent variable x3 53 Regression Coefficients Table • p values for individual t tests each independent variables • t test - tests the claim that there is no relationship between the independent variable (in the corresponding row) and your dependent variable. • Should reject the claim • Of NO significant relationship between your independent variable (in the corresponding row) and dependent variable if p<. 54 Salary = f(GPA) Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations f(GPA) 0.898006642 0.806415929 0.78221792 1479.019946 10 ANOVA Regression Residual Total Intercept GPA df 1 8 9 SS 72900000 17500000 90400000 MS 72900000 2187500 F 33.32571 Significance F 0.00041792 Standard Coefficients Error t Stat P-value Lower 95% Upper 95% 1928.571429 3748.677 0.514467 0.620833 -6715.89326 10573.04 6428.571429 1113.589 5.772843 0.000418 3860.63173 8996.511 55 Salary = f(Work) Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations f(Work) 0.939265177 0.882219073 0.867496457 1153.657002 10 ANOVA Regression Residual Total df 1 8 9 SS 79752604.17 10647395.83 90400000 MS 79752604 1330924 F Significance F 59.92271 5.52993E-05 Standard Coefficients Error t Stat P-value Lower 95% Upper 95% Intercept 30691.66667 1010.136344 30.38369 1.49E-09 28362.28808 33021.0453 Months Work -227.864583 29.43615619 -7.74098 5.53E-05 295.7444812 -159.98469 56 Salary = f(GPA, Work) Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations f(GPA,Work) 0.962978985 0.927328525 0.906565246 968.7621974 10 ANOVA Regression Residual Total Intercept GPA Months Work df 2 7 9 SS 83830499 6569501 90400000 MS 41915249 938500.2 Standard Coefficients Error 19135.92896 5608.184 2725.409836 1307.468 t Stat 3.412144 2.084495 -151.2124317 44.30826 -3.41274 F 44.66195 Significance F 0.00010346 P-value Lower 95% Upper 95% 0.011255 5874.682112 32397.176 0.075582 -366.2602983 5817.08 0.011246 -255.9848174 46.440046 57 Compare Three “Models” Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations f(GPA) 0.898006642 0.806415929 0.78221792 1479.019946 10 Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations f(Work) 0.939265177 0.882219073 0.867496457 1153.657002 10 Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations f(GPA,Work) 0.962978985 0.927328525 0.906565246 968.7621974 10 58