The CRISP Data Mining Process

advertisement
The CRISP Data Mining
Process
The Data Mining Process
Business
understanding
Data
evaluation
Data
preparation
Deployment
Data
Modeling
Evaluation
August 28, 2004
Data Mining
2
Business Understanding
Project
objectives
Project
requirements
DM Problem
Formulation
Preliminary
Plan
August 28, 2004
Data Mining
3
Case Study
Data mining project done for a large
insurance company
Consider the use of data mining to
improve understanding of customer
databases
Led by the data warehousing team,
which wanted to also improve their
expertise
August 28, 2004
Data Mining
4
Business Objectives
Understand what coverage packages are of
interest to a customer group


Targeting of new customers
Cross-selling opportunities to existing customers
Understand why a customer group terminates
coverage


Know in advance what groups are likely to
terminate
Understand what factors influence termination
August 28, 2004
Data Mining
5
What are the Goals?
The business goals
Improve customer retention
 Increase cross-selling

Success criteria
Customer turnover rate
 Amount of cross-selling

August 28, 2004
Data Mining
6
Data Mining Problems
Classify new and existing customers as
either interested or not interested in a
particular coverage
Classify existing customers as either
likely or unlikely to terminate coverage
August 28, 2004
Data Mining
7
The Data Mining Process
Business
objectives
Data
evaluation
Data
preparation
Deployment
Data
Modeling
Evaluation
August 28, 2004
Data Mining
8
Data Evaluation
Initial data collections
Data quality
Data warehousing team
Initial insights
Interesting subsets
August 28, 2004
Data Mining
9
Case Study: Data Evaluation
Data was extracted from select
customer databases by company
personnel
Coverage programs with few customers
selected for pilot project
Five separate files extracted for five
coverage programs
August 28, 2004
Data Mining
10
The Data Mining Process
Business
objectives
Data
evaluation
Data
preparation
Deployment
Data
Modeling
Evaluation
August 28, 2004
Data Mining
11
Data Preparation
Finished
Data Set
Raw Data
Technical tasks:
Data selection
Attribute selection
Data cleaning
August 28, 2004
Data Mining
12
Case Study: Data Preparation
Some initial formatting of data in MS
Excel
Cleaning of data file
 Combine headers/instances
 Add a new attribute: interest (yes/no)
 Must create the no interest cases

End up with a CSV formatted file
August 28, 2004
Data Mining
13
Weka Data Mining Software
Data in CSV format loaded into Weka:
Data preprocessing
 Attribute selection
 Modeling

Classification
 Clustering
 Association rule mining


Visualization
August 28, 2004
Data Mining
14
Data Preprocessing in Weka
Initial data inspection
Missing values
 Useless attributes
 Numeric attributes as nominal

Some helpful Weka filters
RemoveUseless
 ReplaceMissingValues

August 28, 2004
Data Mining
15
Data Preprocessing in Weka
Data reduction:

Instance dimension


RemovePercentage, and Resample filters
Attribute dimension
Remove redundant attributes
 Remove irrelevant attributes
 Identify most important attributes

August 28, 2004
Data Mining
16
Attribute Selection Methods
Three main methods used:

InfoGain

ChiSquared

Relief
Combined results from complimentary
methods
Final pruning of attribute list to twenty
attributes
August 28, 2004
Data Mining
17
Selected Attributes
Location
Tax State
 Contract State
 State Code
 Zip Code

August 28, 2004
Data Mining
18
Selected Attributes
Size

Case Size Range
Industry
Industry Classification
 Industry Classification Name
 SIC Code

August 28, 2004
Data Mining
19
Selected Attributes
Timing
New Sale Flag
 Decision Maker Effective Month
 Decision Maker Effective Year
 Next Renewal Month
 Next Renewal Year

August 28, 2004
Data Mining
20
Selected Attributes
Internal
Agency Number
 Office Name
 Pricing Category Code
 Product Line Name
 Small Group Flag

August 28, 2004
Data Mining
21
Relevance of Attribute
Selection
Improved modeling
Faster model induction
 Higher accuracy
 Easier to interpret models

Structural knowledge gained from the
selection of attributes
August 28, 2004
Data Mining
22
Most Important Attributes
What attributes effect the purchasing decision
of a customer group?
E.g., the five most important factor that
determine if a customer group purchases a
particular insurance coverage





Agency Number
Small Group Flag
Zip Code
Decision Maker Effective Year
Next Renewal Month
August 28, 2004
Data Mining
23
Customer Segmentation
Unique groups of customers
Similar characteristics
 Similar behavior in terms of interest in
coverage

For example, separate predictive
models for customer segments for a
particular type of insurance
August 28, 2004
Data Mining
24
Customer Segments Used for
Modeling
Results
Three segments for one database
 Two segments for two databases
 One segment for two databases

Continue modeling for each segment
independently
August 28, 2004
Data Mining
25
The Data Mining Process
Business
objectives
Data
evaluation
Data
preparation
Deployment
Data
Modeling
Evaluation
August 28, 2004
Data Mining
26
Modeling
Select modeling technique(s)
Calibrate modeling techniques
Make adjustments to data
August 28, 2004
Data Mining
27
Modeling
Mathematical models for predicting if a
customer is interested in a coverage
Understand why a customer is
interested
For example:
If a customer’s state is Indiana and the office
is Indianapolis_Office1 then the customer is
interested in Coverage_3
August 28, 2004
Data Mining
28
Modeling Techniques
Three modeling techniques tried for
predicting customer interest:



Decision trees
Artificial neural networks (ANN)
Support vector machines (SVM)
Decision trees have the advantage of
transparency
ANN and SVM did not have significantly
better prediction accuracy
August 28, 2004
Data Mining
29
Insurance Coverage Interest
(Type 6)
Small Group Flag
Y
No
N
Product
Line Name
Group_1
Yes
August 28, 2004
Data Mining
Group_2
No
30
Insurance Coverage Interest
(Type 7)
Pricing Category
Code
Others
A4
Transportation_and
Public_Utilities
A2
Next
Renewal
Year
Industry
Classification
Name
Legal_Services
Branches
omitted
> 2002
<= 2002
Group_1
Group_2
Next Renewal Year
<= 2000
Yes
August 28, 2004
Agency Number
<= 430
> 2000
No
Yes
No
Data Mining
Yes
Yes
No
> 430
No
31
Accuracy of Predicting
Customer Interest
August 28, 2004
Coverage
Accuracy
Type 1
Type 2
Type 3
Type 4
Type 5
Type 6
Type 7
Type 8
Type 9
84.0%
97.2%
98.3%
99.5%
88.4%
100%
76.3%
85.0%
94.8%
Data Mining
32
Modeling
Mathematical models for predicting if a
customer will terminate coverage
Why do customers terminate a specific
type of coverage?
What are the important factors in a
customers decision to terminate
coverage?
August 28, 2004
Data Mining
33
Who Terminates Type 3
Coverage?
Correct for 95%
of customers
Customer
Effective Year
 2000
Coverage
Effective Year
 1999
 2000
2000
Terminated
7
Terminated
August 28, 2004
Active
Next Renewal
Month
 2000
Coverage
Effective Year
2001
Terminated
2002
Active
7
Active
Data Mining
34
Who Terminates Type 1
Coverage?
Decision tree based on:
Distribution number
 Underwriting department number
 Price category
 Rate type
 Rate Plan Year

Predicts 96.3% of terminations correctly
August 28, 2004
Data Mining
35
Accuracy of Predicting
Termination
August 28, 2004
Model
Accuracy
Type 1
96.3%
Type 2
96.5%
Type 3
95.3%
Type 4
88.9%
Type 5
88.3%
Data Mining
36
The Data Mining Process
Business
objectives
Data
evaluation
Data
preparation
Deployment
Data
Modeling
Evaluation
August 28, 2004
Data Mining
37
Evaluation
Data analysis results in a good model
Are business objectives being achieved?
Is there an important business issue that has
not been considered?
Should the results be used?
August 28, 2004
Data Mining
38
The Data Mining Process
Business
objectives
Data
evaluation
Data
preparation
Deployment
Data
Modeling
Evaluation
August 28, 2004
Data Mining
39
Deployment
Incorporate the results in the
organization’s decision making process
Report
 Decision support system
 Personalization of web pages
 Repeatable data mining process

August 28, 2004
Data Mining
40
Download