The CRISP Data Mining Process The Data Mining Process Business understanding Data evaluation Data preparation Deployment Data Modeling Evaluation August 28, 2004 Data Mining 2 Business Understanding Project objectives Project requirements DM Problem Formulation Preliminary Plan August 28, 2004 Data Mining 3 Case Study Data mining project done for a large insurance company Consider the use of data mining to improve understanding of customer databases Led by the data warehousing team, which wanted to also improve their expertise August 28, 2004 Data Mining 4 Business Objectives Understand what coverage packages are of interest to a customer group Targeting of new customers Cross-selling opportunities to existing customers Understand why a customer group terminates coverage Know in advance what groups are likely to terminate Understand what factors influence termination August 28, 2004 Data Mining 5 What are the Goals? The business goals Improve customer retention Increase cross-selling Success criteria Customer turnover rate Amount of cross-selling August 28, 2004 Data Mining 6 Data Mining Problems Classify new and existing customers as either interested or not interested in a particular coverage Classify existing customers as either likely or unlikely to terminate coverage August 28, 2004 Data Mining 7 The Data Mining Process Business objectives Data evaluation Data preparation Deployment Data Modeling Evaluation August 28, 2004 Data Mining 8 Data Evaluation Initial data collections Data quality Data warehousing team Initial insights Interesting subsets August 28, 2004 Data Mining 9 Case Study: Data Evaluation Data was extracted from select customer databases by company personnel Coverage programs with few customers selected for pilot project Five separate files extracted for five coverage programs August 28, 2004 Data Mining 10 The Data Mining Process Business objectives Data evaluation Data preparation Deployment Data Modeling Evaluation August 28, 2004 Data Mining 11 Data Preparation Finished Data Set Raw Data Technical tasks: Data selection Attribute selection Data cleaning August 28, 2004 Data Mining 12 Case Study: Data Preparation Some initial formatting of data in MS Excel Cleaning of data file Combine headers/instances Add a new attribute: interest (yes/no) Must create the no interest cases End up with a CSV formatted file August 28, 2004 Data Mining 13 Weka Data Mining Software Data in CSV format loaded into Weka: Data preprocessing Attribute selection Modeling Classification Clustering Association rule mining Visualization August 28, 2004 Data Mining 14 Data Preprocessing in Weka Initial data inspection Missing values Useless attributes Numeric attributes as nominal Some helpful Weka filters RemoveUseless ReplaceMissingValues August 28, 2004 Data Mining 15 Data Preprocessing in Weka Data reduction: Instance dimension RemovePercentage, and Resample filters Attribute dimension Remove redundant attributes Remove irrelevant attributes Identify most important attributes August 28, 2004 Data Mining 16 Attribute Selection Methods Three main methods used: InfoGain ChiSquared Relief Combined results from complimentary methods Final pruning of attribute list to twenty attributes August 28, 2004 Data Mining 17 Selected Attributes Location Tax State Contract State State Code Zip Code August 28, 2004 Data Mining 18 Selected Attributes Size Case Size Range Industry Industry Classification Industry Classification Name SIC Code August 28, 2004 Data Mining 19 Selected Attributes Timing New Sale Flag Decision Maker Effective Month Decision Maker Effective Year Next Renewal Month Next Renewal Year August 28, 2004 Data Mining 20 Selected Attributes Internal Agency Number Office Name Pricing Category Code Product Line Name Small Group Flag August 28, 2004 Data Mining 21 Relevance of Attribute Selection Improved modeling Faster model induction Higher accuracy Easier to interpret models Structural knowledge gained from the selection of attributes August 28, 2004 Data Mining 22 Most Important Attributes What attributes effect the purchasing decision of a customer group? E.g., the five most important factor that determine if a customer group purchases a particular insurance coverage Agency Number Small Group Flag Zip Code Decision Maker Effective Year Next Renewal Month August 28, 2004 Data Mining 23 Customer Segmentation Unique groups of customers Similar characteristics Similar behavior in terms of interest in coverage For example, separate predictive models for customer segments for a particular type of insurance August 28, 2004 Data Mining 24 Customer Segments Used for Modeling Results Three segments for one database Two segments for two databases One segment for two databases Continue modeling for each segment independently August 28, 2004 Data Mining 25 The Data Mining Process Business objectives Data evaluation Data preparation Deployment Data Modeling Evaluation August 28, 2004 Data Mining 26 Modeling Select modeling technique(s) Calibrate modeling techniques Make adjustments to data August 28, 2004 Data Mining 27 Modeling Mathematical models for predicting if a customer is interested in a coverage Understand why a customer is interested For example: If a customer’s state is Indiana and the office is Indianapolis_Office1 then the customer is interested in Coverage_3 August 28, 2004 Data Mining 28 Modeling Techniques Three modeling techniques tried for predicting customer interest: Decision trees Artificial neural networks (ANN) Support vector machines (SVM) Decision trees have the advantage of transparency ANN and SVM did not have significantly better prediction accuracy August 28, 2004 Data Mining 29 Insurance Coverage Interest (Type 6) Small Group Flag Y No N Product Line Name Group_1 Yes August 28, 2004 Data Mining Group_2 No 30 Insurance Coverage Interest (Type 7) Pricing Category Code Others A4 Transportation_and Public_Utilities A2 Next Renewal Year Industry Classification Name Legal_Services Branches omitted > 2002 <= 2002 Group_1 Group_2 Next Renewal Year <= 2000 Yes August 28, 2004 Agency Number <= 430 > 2000 No Yes No Data Mining Yes Yes No > 430 No 31 Accuracy of Predicting Customer Interest August 28, 2004 Coverage Accuracy Type 1 Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Type 8 Type 9 84.0% 97.2% 98.3% 99.5% 88.4% 100% 76.3% 85.0% 94.8% Data Mining 32 Modeling Mathematical models for predicting if a customer will terminate coverage Why do customers terminate a specific type of coverage? What are the important factors in a customers decision to terminate coverage? August 28, 2004 Data Mining 33 Who Terminates Type 3 Coverage? Correct for 95% of customers Customer Effective Year 2000 Coverage Effective Year 1999 2000 2000 Terminated 7 Terminated August 28, 2004 Active Next Renewal Month 2000 Coverage Effective Year 2001 Terminated 2002 Active 7 Active Data Mining 34 Who Terminates Type 1 Coverage? Decision tree based on: Distribution number Underwriting department number Price category Rate type Rate Plan Year Predicts 96.3% of terminations correctly August 28, 2004 Data Mining 35 Accuracy of Predicting Termination August 28, 2004 Model Accuracy Type 1 96.3% Type 2 96.5% Type 3 95.3% Type 4 88.9% Type 5 88.3% Data Mining 36 The Data Mining Process Business objectives Data evaluation Data preparation Deployment Data Modeling Evaluation August 28, 2004 Data Mining 37 Evaluation Data analysis results in a good model Are business objectives being achieved? Is there an important business issue that has not been considered? Should the results be used? August 28, 2004 Data Mining 38 The Data Mining Process Business objectives Data evaluation Data preparation Deployment Data Modeling Evaluation August 28, 2004 Data Mining 39 Deployment Incorporate the results in the organization’s decision making process Report Decision support system Personalization of web pages Repeatable data mining process August 28, 2004 Data Mining 40