Data Mining Methods and Models

advertisement
Data Mining Methods and Models
By Daniel T. Larose, Ph.D.
Chapter Summaries and Keywords
Preface.
The preface begins by discussing why Data Mining Methods and Models is needed.
Because of the powerful data mining software platforms currently available, a strong
caveat is given against glib application of data mining methods and techniques. In other
words, data mining is easy to do badly. The best way to avoid these costly errors, which
stem from a blind black-box approach to data mining, is to instead apply a “white-box”
methodology, which emphasizes an understanding of the algorithmic and statistical
model structures underlying the software. Data Mining Methods and Models applies
this white-box approach by (1) walking the reader through the operations and nuances of
the various algorithms, using small sample data sets, so that the reader gets a true
appreciation of what is really going on inside the algorithm, (2) providing examples of
the application of the various algorithms on actual large data sets, (3) supplying chapter
exercises, which allow readers to assess their depth of understanding of the material, as
well as have a little fun playing with numbers and data, and (4) providing the reader with
hands-on analysis problems, representing an opportunity for the reader to apply his or her
newly-acquired data mining expertise to solving real problems using large data sets. Data
mining is presented as a well-structured standard process, namely, the Cross-Industry
Standard Process for Data Mining (CRISP-DM). A graphical approach to data analysis
is emphasized, stressing in particular exploratory data analysis. Data Mining Methods
and Models naturally fits the role of textbook for an introductory course in data mining.
Instructors may appreciate (1) the presentation of data mining as a process, (2) the
“White box” approach, emphasizing an understanding of the underlying algorithmic
structures, (3) the graphical approach, emphasizing exploratory data analysis, and (4) the
logical presentation, flowing naturally from the CRISP-DM standard process and the set
of data mining tasks. Particularly useful for the instructor is the companion website,
which provides ancillary materials for teaching a course using Data Mining Methods and
Models, including Powerpoint® presentations, answer keys, and sample projects. The
book is appropriate for advanced undergraduate or graduate-level courses. No computer
programming or database expertise is required. The software used in the book includes
Clementine, Minitab, SPSS, and WEKA. Free trial versions of Minitab and SPSS are
available for download from their company websites. WEKA is open-source data mining
software freely available for download.
Keywords:
Data Mining Methods and Models
Copyright © by Daniel T. Larose, Ph.D.
Chapter Summary and Keywords
Algorithm walk-throughs, hands-on analysis problems, chapter exercises, “white-box”
approach, data mining as a process, graphical and exploratory approach, companion
website, Clementine, Minitab, SPSS, WEKA.
Chapter 1: Dimension Reduction Methods
Chapter one begins with an assessment of the need for dimension reduction in data
mining. Principal components analysis is demonstrated, in the context of a real-world
example using the Houses data set. Various criteria are compared for determining how
many components should be extracted. Emphasis is given to profiling the principal
components for the end-user, along with the importance of validating the principal
components using the usual hold-out methods in data mining. Next, factor analysis is
introduced and demonstrated using the real-world Adult data set. The need for factor
rotation is discussed, which clarifies the definition of the factors. Finally, user-defined
composites are briefly discussed, using an example.
Key Words: Principal components, factor analysis, commonality, variation, scree plot,
eigenvalues, component weights, factor loadings, factor rotation, user-defined composite.
Chapter 2: Regression Modeling
Chapter Two begins by using an example to introduce simple linear regression and the
concept of least squares. The usefulness of the regression is then measured by the
coefficient of determination r 2 , and the typical prediction error is estimated using the
standard error of the estimate s. The correlation coefficient r is discussed, along with the
ANOVA table for succinct display of results. Outliers, high leverage points, and
influential observations are discussed in detail. Moving from descriptive methods to
inference, the regression model is introduced. The t-Test for the relationship between x
and y is shown, along with the confidence interval for the slope of the regression line, the
confidence interval for the mean value of y given x, and the prediction interval for a
randomly chosen value of y given x. Methods are shown for verifying the assumptions
underlying the regression model. Detailed examples are provided using the Baseball and
California data sets. Finally, methods of applying transformations to achieve linearity is
provided.
Key Words: Simple linear regression, least squares, prediction error, outlier, high
leverage point, influential observation, confidence interval, prediction interval,
transformations.
Chapter 3: Multiple Regression and Model Building
Multiple regression, where more than one predictor variable is used to estimate a
response variable, is introduced by way of an example. To allow for inference, the
Data Mining Methods and Models
Copyright © by Daniel T. Larose, Ph.D.
Chapter Summary and Keywords
multiple regression model is defined, with both model and inferential methods
representing extensions of the simple linear regression case. Next, regression with
categorical predictors (indicator variables) is explained. The problems of
multicollinearity are examined; multicollinearity represents an unstable response surface
due to overly correlated predictors. The variance inflation factor is defined, as an aid in
identifying multicollinear predictors. Variable selection methods are then provided,
including forward selection, backward elimination, stepwise, and best-subsets regression.
Mallows’ C p statistic is defined, as an aid in variable selection. Finally, methods for
using the principal components as predictors in multiple regression are discussed.
Key Words: Categorical predictors, indicator variables, multicollinearity, variance
inflation factor, model selection methods, forward selection, backward elimination,
stepwise regression, best-subsets.
Chapter 4: Logistic Regression
Logistic regression is introduced by way of a simple example for predicting the presence
of disease based on age. The maximum likelihood estimation methods for logistic
regression are outlined. Emphasis is placed on interpreting logistic regression output.
Inference within the framework of the logistic regression model is discussed, including
determining whether the predictors are significant. Methods for interpreting the logistic
regression model are examined, including for dichotomous, polychotomous, and
continuous predictors. The assumption of linearity is discussed, as well as methods for
tackling the zero-cell problem. We then turn to multiple logistic regression, where more
than one predictor is used to classify a response. Methods are discussed for introducing
higher order terms to handle non-linearity. As usual, the logistic regression model must
be validated. Finally, the application of logistic regression using the freely available
software WEKA is demonstrated, using a small example.
Key Words: Maximum likelihood estimation, categorical response, classification, the
zero-cell problem, multiple logistic regression, WEKA.
Chapter 5: Naïve Bayes and Bayesian Networks
Chapter Five begins by contrasting the Bayesian approach with the usual (frequentist)
approach to probability. The maximum a posteriori (MAP) classification is defined,
which is used to select the preferred response classification. Odds ratios are discussed,
including the posterior odds ratio. The importance of balancing the data is discussed.
Naïve Bayes classification is derived, using a simplifying assumption which greatly
reducing the search space. Methods for handling numeric predictors for Naïve Bayes
classification are demonstrated. An example of using WEKA for Naïve Bayes is
provided. Then, Bayesian Belief Networks (Bayes Nets) are introduced and defined.
Data Mining Methods and Models
Copyright © by Daniel T. Larose, Ph.D.
Chapter Summary and Keywords
Methods for using the Bayesian network to find probabilities are discussed. Finally, an
example of using Bayes nets in WEKA is provided.
Key Words: Bayesian approach, maximum a posteriori classification, odds ratio,
posterior odds ratio, balancing the data, Naïve Bayes classification, Bayesian belief
networks, WEKA.
Chapter 6: Genetic Algorithms
Chapter Six begins by introducing genetic algorithms by way of analogy with the
biological processes at work in the evolution of organisms. The basic framework of a
genetic algorithm is provided, including the three basic operators: Selection, Crossover,
and Mutation. A simple example of a genetic algorithm at work is examined, with each
step explained and demonstrated. Next, modifications and enhancements from the
literature are discussed, especially for the selection and crossover operators. Genetic
algorithms for real-valued variables are discussed. The use of genetic algorithms as
optimizers within a neural network is demonstrated, where the genetic algorithm replaces
the using backpropagation algorithm. Finally, an example of the use of WEKA for
genetic algorithms is provided.
Key Words: Selection, crossover, mutation, optimization, global optimum, selection
pressure, crowding, fitness, WEKA.
Chapter 7: Case Study: Modeling Response to Direct Mail Marketing
The case study begins with an overview of the cross-industry standard process for data
mining: CRISP-DM. For the business understanding phase, the direct mail marketing
response problem is defined, with particular emphasis on the construction of an accurate
cost / benefit table, which will be used to assess the usefulness of all later models. In the
data understand and data preparation phases, the Clothing Store data set is explored.
Transformations to achieve normality or symmetry are applied, as is standardization and
the construction of flag variables. Useful new variables are derived. The relationships
between the predictors and the response are explored, and the correlation structure among
the predictors is investigated. Next comes the modeling phase. Here, two principal
components are derived, using principal components analysis. Clustering analysis is
performed, using the BIRCH clustering algorithm. Emphasis is laid on the effects of
balancing (and over-balancing) the training data set. The baseline model performance is
established. Two sets of models are examined, Collection A, which uses the principal
components, and Collection B, which does not. The technique of using over-balancing as
a surrogate for misclassification costs is applied. The method of combining models via
voting is demonstrated, as is the method of combining models using the mean response
probabilities.
Data Mining Methods and Models
Copyright © by Daniel T. Larose, Ph.D.
Chapter Summary and Keywords
Key Words: CRISP-DM standard process for data mining, BIRCH clustering algorithm,
over-balancing, misclassification costs, cost / benefit analysis, model combination,
voting, mean response probabilities.
Data Mining Methods and Models
Copyright © by Daniel T. Larose, Ph.D.
Chapter Summary and Keywords
Download