Gastrointestinal Cancer Committee

advertisement
Wharton
Department of Statistics
Profiting from Data Mining
Bob Stine
Department of Statistics
The Wharton School, Univ of Pennsylvania
April 5, 2002
www-stat.wharton.upenn.edu/~bob
Overview
 Critical
Wharton
Department of Statistics
stages of data mining process
- Choosing the right data, people, and problems
- Modeling
- Validation
 Automated
modeling
- Feature creation and selection
- Exploiting expert knowledge, “insights”
 Applications
- Little detail – Biomedical: finding predictive risk factors
- More detail – Financial: predicting returns on the market
- Lots of detail – Credit: anticipating the onset of bankruptcy
2
Predicting Health Risk
 Who
Wharton
Department of Statistics
is at risk for a disease?
- Example: detect osteoporosis without expense of x-ray
 Goals
- Improving public health
- Savings on medical care
- Confirm an informal model with data mining
 Many
types of features, interested groups
- Clinical observations of doctors
- Laboratory measurements, “genetic”
- Self-reported behavior
 Missing
data
3
Predicting the Stock Market
 Small,
Wharton
Department of Statistics
“hands-on” example
 Goals
- Better retirement savings?
- Money for that special vacation?
- Trade-offs: risk vs return
 Lots
College?
of “free” data
- Access to accurate historical time trends, macro factors
- Recent data more useful than older data
 “Simple”
modeling technique
 Validation
4
Predicting the Market: Specifics Wharton
Department of Statistics
 Build
a regression model
- Response is return on the value-weighted S&P
- Use standard forward/backward stepwise
- Battery of 12 predictors with interactions
 Train
the model during 1992-1996 (training data)
- Model captures most of variation in 5 years of returns
- Retain only the most significant features (Bonferroni)
 Predict
returns in 1997 (validation data)
 Another
version in Foster, Stine & Waterman
5
Wharton
Historical patterns?
Department of Statistics
0.0 8
0.0 6
vwReturn
0.0 4
0.0 2
?
0.0 0
-0.0 2
-0.0 4
-0.0 6
92
93
94
95
96
97
98
Yea r
6
Wharton
Fitted model predicts...
Department of Statistics
0.15
Exceptional Feb return?
0.10
0.05
-0.00
-0.05
92
93
94
95
96
97
98
Ye ar
7
Wharton
What happened?
Department of Statistics
0.10
Pred Error
0.05
-0.00
-0.05
Training Period
-0.10
-0.15
92
93
94
95
96
97
98
Ye ar
8
Wharton
Claimed versus Actual Error
Department of Statistics
12 0
Actual
Squared 10 0
Prediction
Error 80
60
40
Claimed
20
0
10
20
30
40
50
60
70
80
90
10 0
Comp lexity o f Mode l
9
Over-confidence?
Wharton
Department of Statistics
 Over-fitting
- Model fits the training data too well –
better than it can predict the future.
- Greedy fitting procedure
“Optimization capitalizes on chance”
 Some
intuition
- Coincidences
• Cancer clusters, the “birthday problem”
- Illustration with an auction
• What is the value of the coins in this jar?
10
Auctions and Over-fitting
Wharton
Department of Statistics
What is the value of these coins?
11
Auctions and Over-fitting



Auction jar of coins to a
class of MBA students
Histogram shows the bids of
30 students
Most were suspicious, but a
few were not!

Actual value is $3.85

Known as “Winner’s Curse”

Similar to over-fitting:
best model like high bidder
Wharton
Department of Statistics
9
8
7
6
5
4
3
2
1
12
Profiting from data mining?
 Where’s
Wharton
Department of Statistics
the profit in this?
- “Mining the miners” vs getting value from your data
- Lost opportunities
 Importance
 Validation
of domain knowledge
as a measure of success
- Prediction provides an explicit check
- Does your application predict something?
13
Pitfalls and Role of Management
Wharton
Department of Statistics
Over-fitting is dominated by other issues…
 Management
support
- Life in silos
- Coordination across domains
 Responsibility
and reward
- Accountability
- Who gets the credit when it succeeds?
Who suffers if the project is not successful?
14
Specific Potholes
 Moving
Wharton
Department of Statistics
targets
- “Let’s try this with something else.”
 Irrational
expectations
- “I could have done better than that.”
 Not
with my data
- “It’s our data. You can’t use it.”
- “You did not use our data properly.”
15
Back to a real application…
Wharton
Department of Statistics
Emphasis on the statistical issues…
16
Predicting Bankruptcy
Wharton
Department of Statistics
 Goal
- Reduce losses stemming from personal bankruptcy
 Possible
strategies
- If can identify those with highest risk of bankruptcy…
Take some action
• Call them for a “friendly chat” about circumstances
• Unilaterally reduce credit limit
 Trade-off
- Good customers borrow lots of money
- Bad customers also borrow lots of money
17
Predicting Bankruptcy
 “Needle
Wharton
Department of Statistics
in a haystack”
- 3,000,000 months of credit-card activity
- 2244 bankruptcies
- Simple predictor that all are OK looks pretty good.
 What
factors anticipate bankruptcy?
- Spending patterns? Payment history?
- Demographics? Missing data?
- Combinations of factors?
• Cash Advance + Las Vegas = Problem
 We
consider more than 100,000 predictors!
18
Modeling: Predictive Models
Wharton
Department of Statistics
 Build
the model
Identify patterns in training data that predict future
observations.
- Which features are real? Coincidental?
 Evaluate
the model
How do you know that it works?
- During the model construction phase
• Only incorporate meaningful features
- After the model is built
• Validate by predicting new observations
19
Are all prediction errors the same?
Wharton
Department of Statistics
 Symmetry
- Is over-predicting as costly as under-predicting?
- Managing inventories and sales
- Visible costs versus hidden costs
 Does
a false positive = a false negative?
- Classification in data mining
- Credit modeling, flagging “risky” customers
- False positive: call a good customer “bad”
- False negative: fail to identify a “bad”
- Differential costs for different types of errors
20
Building a Predictive Model
Wharton
Department of Statistics
So many choices…
 Structure:
What type of model?
• Neural net
• CART, classification tree
• Additive model or regression spline
 Identification: Which
features to use?
• Time lags, “natural” transformations
• Combinations of other features
 Search:
How does one find these features?
• Brute force has become cheap.
21
Our Choices
Wharton
Department of Statistics
 Structure
- Linear regression with nonlinearity via interactions
- All 2-way and some 3-way, 4-way interactions
- Missing data handled with indicators

Identification
- Conservative standard error
- Comparison of conservative t-ratio to adaptive threshold
 Search
- Forward stepwise regression
- Coming: Dynamically changing list of features
• Good choice affects where you search next.
22
Identifying Predictive Features
 Classical
Wharton
Department of Statistics
problem of “variable selection”
 Thresholding
methods (compare t-ratio to threshold)
- Akaike information criterion (AIC)
- Bayes information criterion (BIC)
- Hard thresholding and Bonferroni
 Arguments
for adaptive thresholds
- Empirical Bayes
- Information theory
- Step-up/step-down tests
23
Adaptive Thresholding
 Threshold
Wharton
Department of Statistics
changes to conform to attributes of data
- Easier to add features as more are found.
 Threshold
for first predictor
- Compare conservative t-ratio to Bonferroni.
- Bonferroni is about Sqrt(2 log p)
- If something significant is found, continue.
 Threshold
for second predictor
- Compare t-ratio to reduced threshold
- New threshold is about Sqrt(2 log p/2)
24
Adaptive Thresholding: Benefits Wharton
Department of Statistics
 Easy
As easy and fast as implementing the standard
criterion that is used in stepwise regression.
 Theory
Resulting model provably as good as best Bayes
model for the problem at hand.
 Real
world
It works! Finds models with real signal, and stops
when the signal runs out.
25
Bankruptcy Model: Construction
 Data:
Wharton
Department of Statistics
reserve 80% for validation
- Training data
• 600,000 months
• 458 bankruptcies
- Validation data
• 2,400,000 months
• 1786 bankruptcies
 Selection
via adaptive thresholding
- Compare sequence of t-statistics to Sqrt(2 log p/q)
- Dynamic expansion of feature space
26
Bankruptcy Model: Preview
Wharton
Department of Statistics
 Predictors
- Initial search identifies 39
• Validation SS monotonically falls to 1650
• Linear fit can do no better than 1735
- Expanded search of higher interactions finds a bit more
• Nature of predictors comprising the interactions
• Validation SS drops 10 more
 Validation:
Lift chart
- Top 1000 candidates have 351 bankrupt
 More
validation: Calibration
- Close to actual Pr(bankrupt) for most groups.
27
Bankruptcy Model: Fitting
Department of Statistics
should the fitting process be stopped?
Residual Sum of Squares
SS
 Where
Wharton
470
460
450
440
430
420
410
400
0
50
100
150
Number of Predictors
28
Bankruptcy Model: Fitting
Wharton
Department of Statistics
 Our
adaptive selection procedure stops at a model
with 39 predictors.
SS
Residual Sum of Squares
470
460
450
440
430
420
410
400
0
50
100
150
Number of Predictors
29
Bankruptcy Model: Validation
Wharton
Department of Statistics
 The
validation indicates that the fit gets better while
the model expands. Avoids over-fitting.
Validation Sum of Squares
1760
SS
1720
1680
1640
0
50
100
150
Number of Predictors
30
Bankruptcy Model: Linear?
Wharton
Department of Statistics
 Choosing
from linear predictors (no interactions) does
not match the performance of the full search.
Validation Sum of Squares
1760
SS
1720
1680
1640
0
50
100
150
Number of Predictors
Linear
Quadratic
31
Wharton
Bankruptcy Model: More?
Department of Statistics
 Searching
higher-order interactions offers modest
improvement.
Validation Sum of Squares
SS
1680
1640
0
20
40
60
Number of Predictors
Quadratic
Cubic
32
Lift Chart
 Measures
Wharton
Department of Statistics
how well model classifies sought-for group
% bankrupt in DM selection
Lift 
% bankrupt in all data
 Depends
on rule used to label customers
- Very
high threshold
Lots of lift, but few bankrupt customers are found.
- Lower threshold
Lift drops, but finds more bankrupt customers.
33
Wharton
Generic Lift Chart
Department of Statistics
1.0
Model
%Respon ders
0.8
Random
0.6
0.4
0.2
0.0
0
10
20
30
40
50
60
70
80
90
100
% Cho sen
34
Wharton
Bankruptcy Model: Lift
 Much
Department of Statistics
better than diagonal!
100
% Found
75
50
25
0
0
25
50
75
100
% Contacted
35
Wharton
Calibration
Classifier assigns
Prob(“BR”)
rating to a customer.

Weather forecast

Among those classified as
2/10 chance of “BR”,
how many are BR?

10 0
75
Actual

Department of Statistics
50
25
0
10
20
30
40
50
60
70
80
90
Closer to diagonal is
better.
36
Bankruptcy Model: Calibration
 Over-predicts
Wharton
Department of Statistics
risk above claimed probability 0.4
Calibration Chart
1.2
Actual
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Claim
37
Summary of Bankruptcy Model Wharton
Department of Statistics
 Automatic,
adaptive selection
- Finds patterns that predict new observations
- Predictive, but not easy to explain
 Dynamic
feature set
- Current research
- Information theory allows changing search space
- Finds more structure than direct search could find
 Validation
- Essential only for judging fit.
- Better than “hand-made models” that take years to create.
38
So, where’s the profit in DM?
Wharton
Department of Statistics
 Automated
modeling has become very powerful,
avoiding problems of over-fitting.
 Role
for expert judgment remains
- What data to use?
- Which features to try first?
- What are the economics of the prediction errors?
 Collaboration
- Data sources
- Data analysis
- Strategic decisions
39
Download