Volume I: Discovering Knowledge in Data: An Introduction to Data Mining Brief Table of Contents Preface Chapter 1. An Introduction to Data Mining Chapter 2. Data Preprocessing Chapter 3. Exploratory Data Analysis Chapter 4. Statistical Approaches to Estimation and Prediction Chapter 5. K-Nearest Neighbor Chapter 6. Decision Trees Chapter 7. Neural Networks Chapter 8. Hierarchical and K-Means Clustering Chapter 9. Kohonen networks Chapter 10. Association Rules Chapter 11. Model Evaluation Techniques Epilogue Detailed Table of Contents Preface Chapter 1. An Introduction to Data Mining o What is Data Mining? o Why Data Mining? o The Need for Human Direction of Data Mining o The Cross-Industry Standard Process CRISP –DM o Case Study 1 : Analyzing Automobile Warranty Claims o Fallacies of Data Mining o What Tasks Can Data Mining Accomplish? o Case Study 2: Predicting Abnormal Stock Market Returns Using Neural Networks o Case Study 3: Mining Association Rules from Legal Databases o Case Study 4: Predicting Corporate Bankruptcies Using Decision Trees o Case Study 5: Profiling the Tourism Market using K-Means Clustering o Chapter 1 Bibliography o Chapter 1 Exercises Chapter 2. Data Preprocessing o Why Do We Need to Preprocess the Data? o Data Cleaning o Handling Missing Data o Identifying Misclassifications o Graphical Methods for Identifying Outliers o Data Normalization: o Min-Max Normalization o Z-Score Standardization o Numerical Methods for Identifying Outliers o o o o o Using Z-Scores for Identifying Outliers Robust Detection of Outliers Chapter 2 Bibliography Chapter 2 Exercises Chapter 2 Hands-On Analysis Chapter 3. Exploratory Data Analysis o Hypothesis Testing vs. Exploratory Data Analysis o EDA: Getting to Know the Data Set o EDA: Dealing with Correlated Variables o EDA: Exploring Categorical Variables o Using EDA to Uncover Anomalous Fields o EDA: Exploring Numeric Variables o EDA: Exploring Multivariate Relationships o EDA: Selecting Interesting Subsets of the Data for Further Investigation o Binning o Chapter 3 Bibliography o Chapter 3 Exercises o Chapter 3 Hands-On Analysis Chapter 4. Statistical Approaches to Estimation and Prediction o The Data Mining Tasks in Discovering Knowledge in Data o Statistical Approaches to Estimation and Prediction o Univariate Methods: Measures of Center and Spread o Statistical Inference o How Confident Are We in Our Estimates? o Confidence Interval Estimation o Simple Linear Regression o The Dangers of Extrapolation o Confidence Intervals for the Mean Value of y Given x o Prediction Intervals for a Randomly Chosen Value of y Given x o Multiple Regression o Verifying Model Assumptions o Chapter 4 Bibliography o Chapter 4 Exercises o Chapter 4 Hands-On Analysis Chapter 5. K-Nearest Neighbor o Supervised Learning vs. Unsupervised Learning o A Methodology for Supervised Modeling o The Classification Task o The K-Nearest Neighbor Algorithm o The Distance Function o The Combination Function o Weighted Voting o Quantifying Attribute Relevance: Stretching the Axes o o o o o Database Considerations K-Nearest Neighbor for Estimation and Prediction Choosing K Chapter 5 Bibliography Chapter 5 Exercises Chapter 6. Decision Trees o Decision Trees o Classification and Regression Trees o The C4.5 Algorithm o Decision Rules o A Comparison of the C5.0 and CART Algorithms Applied to Real Data o Chapter 6 Bibliography o Chapter 6 Exercises o Chapter 6 Hands-On Analysis Chapter 7. Neural Networks o Input and Output Encoding o Neural Networks for Estimation and Prediction o A Simple Example of a Neural Network o The Sigmoid Activation Function o Backpropagation o The Gradient Descent Method o The Backpropagation Rules o An Example of Backpropagation o Termination Criteria o The Learning Rate o The Momentum Term o Sensitivity Analysis o An Application of Neural Network Modeling o Chapter 7 Bibliography o Chapter 7 Exercises o Chapter 7 Hands-On Analysis Chapter 8. Hierarchical and K-Means Clustering o The Clustering Task o Hierarchical Clustering Methods o K-Means Clustering o An Application of K-Means Clustering using SAS Enterprise Miner o Using Cluster Membership to Predict Churn o Chapter 8 Bibliography o Chapter 8 Exercises o Chapter 8 Hands-On Analysis Chapter 9: Kohonen networks o Self-Organizing Maps o o o o o o o o o o Kohonen Networks An Example Cluster Validity An Application of Clustering Using Kohonen Networks Interpreting the Clusters Cluster Profiles Using Cluster Membership as Input to Downstream Data Mining Models Chapter 9 Bibliography Chapter 9 Exercises Chapter 9 Hands-On Analysis Chapter 10. Data Mining Techniques: Association Rules o Affinity Analysis and Market Basket Analysis o Data Representation for Market Basket Analysis o Support, Confidence, Frequent Itemsets, and the A Priori Property o How Does the A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets o How Does the A Priori Algorithm Work (Part 2)? Generating Association Rules o The Extension from Flag Data to General Categorical Data o An Information Theoretic Approach: The Generalized Rule Induction Method o The J-Measure o An Application of Generalized Rule Induction o When Not To Use Association Rules o Do Association Rules Represent Supervised or Unsupervised Learning? o Local Patterns vs. Global Models o Chapter 10 Bibliography o Chapter 10 Exercises o Chapter 10 Hands-On Analysis Chapter 11. Model Evaluation Techniques o Model Evaluation Techniques for the Description Task o Model Evaluation Techniques for the Estimation and Prediction Tasks o Model Evaluation Techniques for the Classification Task o Error Rate, False Positives, and False Negatives o Misclassification Cost Adjustment to Reflect Real-World Concerns o Decision Cost / Benefit Analysis o Lift Charts and Gains Charts o Interweaving Model Evaluation with Model Building o Confluence of Results: Applying a Suite of Models o Chapter 11 Bibliography o Chapter 11 Exercises o Chapter 11 Hands-On Analysis Epilogue. We’ve Only Just Begun: An Invitation to Data Mining Methods and Models