Chapter 2 Overview of the Data Mining Process 1 Introduction • Data Mining – Predictive analysis • Tasks of Classification & Prediction • Core of Business Intelligence • Data Base Methods – OLAP – SQL – Do not involve statistical modeling 2 Core Ideas in Data Mining • Analytical Methods Used in Predictive Analytics – Classification • Used with categorical response variables • E.g. Will purchase be made / not made? – Prediction • Predict (estimate) value of continuous response variable • Prediction used with categorical as well – Association Rules • Affinity analysis – “what goes with what” • Seeks correlations among data 3 Core Ideas in Data Mining • Data Reduction – Reduce variables – Group together similar variables • Data Exploration – View data as evidence – Get “a feel” for the data • Data Visualization – Graphical representation of data – Locate tends, correlations, etc. 4 Supervised Learning • “Supervised learning" algorithms are those used in classification and prediction. – Data is available in which the value of the outcome of interest is known. • “Training data" are the data from which the classification or prediction algorithm “learns," or is “trained," about the relationship between predictor variables and the outcome variable. • This process results in a “model” – Classification Model – Predictive Model 5 Supervised Learning • Model is then run with another sample of data – “validation data" – the outcome is known but we wish to see how well the model performs – If many different models are being tried out, a third sample of known outcomes -“test data” is used with the final, selected model to predict how well it will do. • The model can then be used to classify or predict the outcome of interest in new cases where the outcome is unknown. 6 Supervised Learning • Linear regression analysis is an example of supervised Learning – The Y variable is the (known) outcome variable – The X variable is some predictor variable. – A regression line is drawn to minimize the sum of squared deviations between the actual Y values and the values predicted by this line. – The regression line can now be used to predict Y values for new values of X for which we do not know the Y value. 7 Unsupervised Learning • No outcome variable to predict or classify • No “learning” from cases • Unsupervised leaning methods – Association Rules – Data Reduction Methods – Clustering Techniques 8 The Steps in Data Mining • 1. Develop an understanding of the purpose of the data mining project – It is a one-shot effort to answer a question or questions or – Application (if it is an ongoing procedure). • 2. Obtain the dataset to be used in the analysis. – Random sampling from a large database to capture records to be used in an analysis – Pulling together data from different databases. • Internal (e.g. Past purchases made by customers) • External (credit ratings). – Usually the analysis to be done requires only thousands or tens of thousands of records. 9 The Steps in Data Mining • 3. Explore, clean, and preprocess the data – Verifying that the data are in reasonable condition. – How missing data should be handled? – Are the values in a reasonable range, given what you would expect for each variable? – Are there obvious “outliers?" – Data are reviewed graphically – • For example, a matrix of scatter plots showing the relationship of each variable with each other variable. – Ensure consistency in the definitions of fields, units of measurement, time periods, etc. 10 The Steps in Data Mining • 4. Reduce the data – If supervised training is involved separate them into training, validation and test datasets. – Eliminating unneeded variables, • Transforming variables – Turning “money spent" into “spent > $100" vs. “Spent · $100"), • Creating new variables – A variable that records whether at least one of several products was purchased – Make sure you know what each variable means, and whether it is sensible to include it in the model. • 5. Determine the data mining task – Classification, prediction, clustering, etc. • 6. Choose the data mining techniques to be used – Regression, neural nets, hierarchical clustering, etc. 11 The Steps in Data Mining • 7. Use algorithms to perform the task. – Iterative process - trying multiple variants, and often using multiple variants of the same algorithm (choosing different variables or settings within the algorithm). – When appropriate, feedback from the algorithm's performance on validation data is used to refine the settings. • 8. Interpret the results of the algorithms. – Choose the best algorithm to deploy, – Use final choice on the test data to get an idea how well it will perform. • 9. Deploy the model. – Integrate the model into operational systems – Run it on real records to produce decisions or actions. – For example, the model might be applied to a purchased list of possible customers, and the action might be “include in the mailing if the predicted amount of purchase is > $10." 12 Preliminary Steps • Organization of datasets – Records in rows – Variables in columns • In supervised learning one of these will be the outcome variable • Labels the first or last column • Sampling from a database – Use a samples to create, validate, & test model • Oversampling rare events – If response variable value is seldom found in data then sample size increase – Adjust algorithm as necessary 13 Preliminary Steps (Pre-processing and Cleaning the Data) • Types of variables – Continuous – assumes a any real numerical value (generally within a specified range) – Categorical – assumes one of a limited number of values • • • • Text (e.g. Payments e {current, not current, bankrupt} Numerical (e.g. Age e {0 … 120} ) Nominal (payments) Ordinal (age) 14 Preliminary Steps (Pre-processing and Cleaning the Data) • Handling categorical variables – If categorical is ordered then it can be used as continuous variable (e..G. Age, level of credit, etc.) – Use of “dummy” variables when range of values not large • e.g. Variable occupation e {student, unemployed, employed, retired} • Create binary (yes/no) dummy variables – – – – Student – yes/no Unemployed – yes/no Employed – yes/no Retired – yes/no • Variable selection – The more predictor variables the more records need to build the model – Reduce number of variables whenever appropriate 15 Preliminary Steps (Pre-processing and Cleaning the Data) • Overfitting – Building a model - describe relationships among variables in order to predict future outcome (dependent) values on the basis of future predictor (independent) values. – Avoid “explaining“ variation in the data that was nothing more than chance variation. Avoid mislabeling “noise” in the data as if it were a “signal” – Caution - if the dataset is not much larger than the number of predictor variables, then it is very likely that a spurious relationship like this will creep into the model 16 Overfitting 17 Preliminary Steps (Pre-processing and Cleaning the Data) • How many variables & how much data • A good rule of thumb is to have ten records for every predictor variable. • For classification procedures – At least 6xmxp records, – Where m = number of outcome classes, and p = number of variables • Compactness or parsimony is a desirable feature in a model. • A matrix of x-y plots can be useful in variable selection. • Can see at a glance x-y plots for all variable combinations. – A straight line would be an indication that one variable is exactly correlated with another. – We would want to include only one of them in our model. • Weed out irrelevant and redundant variables from our model • Consult domain expert whenever possible 18 Preliminary Steps (Pre-processing and Cleaning the Data) • Outliers – Values that lie far away from the bulk of the data are called outliers – no statistical rule can tell us whether such an outlier is the result of an error – these are judgments best made by someone with “domain" knowledge – if the number of records with outliers is very small, they might be treated as missing data. 19 Preliminary Steps (Pre-processing and Cleaning the Data) • Missing values – If the number of records with missing values is small, those records might be omitted – The more variables, the more records to dropped • Solution - use average value computed from records with valid data for variable with missing data • Reduces variability in data set – Human judgment can be used to determine best way to handle missing data 20 Preliminary Steps (Pre-processing and Cleaning the Data) • Normalizing (standardizing) the data – To normalize the data, we subtract the mean from each value, and divide by the standard deviation of the resulting deviations from the mean • Expressing each value as “number of standard deviations away from the mean“ – the z-score • Needed if variables are in different units e.G. Hours, thousands of dollars, etc. – Clustering algorithms measure variables values in distance from each other – need a standard value for distance. – Data mining software, including XLMiner, typically has an option that normalizes the data in those algorithms where it may be required 21 Preliminary Steps • Use and creation of partition – Training partition • The largest partition • Contains the data used to build the various models • Same training partition is generally used to develop multiple models. – Validation partition • Used to assess the performance of each model, • Used to compare models and pick the best one. • In classification and regression trees algorithms the validation partition may be used automatically to tune and improve the model. – Test partition • Sometimes called the “holdout" or “evaluation" partition is used to assess the performance of a chosen model with new data. 22 The Three Data Partitions and Their Role in the Data Mining Process 23 Simple Regression Example 24 Simple Regression Model • Make prediction about the starting salary of a current college graduate • Data set of starting salaries of recent college graduates Data Set Compute Average Salary How certain are of this prediction? There is variability in the data. 25 Simple Regression Model • Use total variation as an index of uncertainty about our prediction Compute Total Variation • The smaller the amount of total variation the more accurate (certain) will be our prediction. 26 Simple Regression Model • How “explain” the variability - Perhaps it depends on the student’s GPA Salary GPA 27 Simple Regression Model • Find a linear relationship between GPA and starting salary • As GPA increases/decreases starting salary increases/decreases 28 Simple Regression Model • Least Squares Method to find regression model – Choose a and b in regression model (equation) so that it minimizes the sum of the squared deviations – actual Y value minus predicted Y value (Y-hat) 29 Simple Regression Model • How good is the model? a= 4,779 & b = 5,370 A computer program computed these values u-hat is a “residual” value The sum of all u-hats is zero The sum of all u-hats squared is the total variance not explained by the model “unexplained variance” is 7,425,926 30 Simple Regression Model Total Variation = 23,000,000 31 Simple Regression Model Total Unexplained Variation = 7,425,726 32 Simple Regression Model • Relative Goodness of Fit – Summarize the improvement in prediction using regression model • Computer R2 – coefficient of determination Regression Model (equation) a better predictor than guessing the average salary The GPA is a more accurate predictor of starting salary than guessing the average R2 is the “performance measure“ for the model. Predicted Starting Salary = 4,779 + 5,370 * GPA 33 Building a Model - An Example with Linear Regression 34 Problems • Problem 2.11 Page 33 35