Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 www-stat.wharton.upenn.edu/~bob Overview Critical Wharton Department of Statistics stages of data mining process - Choosing the right data, people, and problems - Modeling - Validation Automated modeling - Feature creation and selection - Exploiting expert knowledge, “insights” Applications - Little detail – Biomedical: finding predictive risk factors - More detail – Financial: predicting returns on the market - Lots of detail – Credit: anticipating the onset of bankruptcy 2 Predicting Health Risk Who Wharton Department of Statistics is at risk for a disease? - Example: detect osteoporosis without expense of x-ray Goals - Improving public health - Savings on medical care - Confirm an informal model with data mining Many types of features, interested groups - Clinical observations of doctors - Laboratory measurements, “genetic” - Self-reported behavior Missing data 3 Predicting the Stock Market Small, Wharton Department of Statistics “hands-on” example Goals - Better retirement savings? - Money for that special vacation? - Trade-offs: risk vs return Lots College? of “free” data - Access to accurate historical time trends, macro factors - Recent data more useful than older data “Simple” modeling technique Validation 4 Predicting the Market: Specifics Wharton Department of Statistics Build a regression model - Response is return on the value-weighted S&P - Use standard forward/backward stepwise - Battery of 12 predictors with interactions Train the model during 1992-1996 (training data) - Model captures most of variation in 5 years of returns - Retain only the most significant features (Bonferroni) Predict returns in 1997 (validation data) Another version in Foster, Stine & Waterman 5 Wharton Historical patterns? Department of Statistics 0.0 8 0.0 6 vwReturn 0.0 4 0.0 2 ? 0.0 0 -0.0 2 -0.0 4 -0.0 6 92 93 94 95 96 97 98 Yea r 6 Wharton Fitted model predicts... Department of Statistics 0.15 Exceptional Feb return? 0.10 0.05 -0.00 -0.05 92 93 94 95 96 97 98 Ye ar 7 Wharton What happened? Department of Statistics 0.10 Pred Error 0.05 -0.00 -0.05 Training Period -0.10 -0.15 92 93 94 95 96 97 98 Ye ar 8 Wharton Claimed versus Actual Error Department of Statistics 12 0 Actual Squared 10 0 Prediction Error 80 60 40 Claimed 20 0 10 20 30 40 50 60 70 80 90 10 0 Comp lexity o f Mode l 9 Over-confidence? Wharton Department of Statistics Over-fitting - Model fits the training data too well – better than it can predict the future. - Greedy fitting procedure “Optimization capitalizes on chance” Some intuition - Coincidences • Cancer clusters, the “birthday problem” - Illustration with an auction • What is the value of the coins in this jar? 10 Auctions and Over-fitting Wharton Department of Statistics What is the value of these coins? 11 Auctions and Over-fitting Auction jar of coins to a class of MBA students Histogram shows the bids of 30 students Most were suspicious, but a few were not! Actual value is $3.85 Known as “Winner’s Curse” Similar to over-fitting: best model like high bidder Wharton Department of Statistics 9 8 7 6 5 4 3 2 1 12 Profiting from data mining? Where’s Wharton Department of Statistics the profit in this? - “Mining the miners” vs getting value from your data - Lost opportunities Importance Validation of domain knowledge as a measure of success - Prediction provides an explicit check - Does your application predict something? 13 Pitfalls and Role of Management Wharton Department of Statistics Over-fitting is dominated by other issues… Management support - Life in silos - Coordination across domains Responsibility and reward - Accountability - Who gets the credit when it succeeds? Who suffers if the project is not successful? 14 Specific Potholes Moving Wharton Department of Statistics targets - “Let’s try this with something else.” Irrational expectations - “I could have done better than that.” Not with my data - “It’s our data. You can’t use it.” - “You did not use our data properly.” 15 Back to a real application… Wharton Department of Statistics Emphasis on the statistical issues… 16 Predicting Bankruptcy Wharton Department of Statistics Goal - Reduce losses stemming from personal bankruptcy Possible strategies - If can identify those with highest risk of bankruptcy… Take some action • Call them for a “friendly chat” about circumstances • Unilaterally reduce credit limit Trade-off - Good customers borrow lots of money - Bad customers also borrow lots of money 17 Predicting Bankruptcy “Needle Wharton Department of Statistics in a haystack” - 3,000,000 months of credit-card activity - 2244 bankruptcies - Simple predictor that all are OK looks pretty good. What factors anticipate bankruptcy? - Spending patterns? Payment history? - Demographics? Missing data? - Combinations of factors? • Cash Advance + Las Vegas = Problem We consider more than 100,000 predictors! 18 Modeling: Predictive Models Wharton Department of Statistics Build the model Identify patterns in training data that predict future observations. - Which features are real? Coincidental? Evaluate the model How do you know that it works? - During the model construction phase • Only incorporate meaningful features - After the model is built • Validate by predicting new observations 19 Are all prediction errors the same? Wharton Department of Statistics Symmetry - Is over-predicting as costly as under-predicting? - Managing inventories and sales - Visible costs versus hidden costs Does a false positive = a false negative? - Classification in data mining - Credit modeling, flagging “risky” customers - False positive: call a good customer “bad” - False negative: fail to identify a “bad” - Differential costs for different types of errors 20 Building a Predictive Model Wharton Department of Statistics So many choices… Structure: What type of model? • Neural net • CART, classification tree • Additive model or regression spline Identification: Which features to use? • Time lags, “natural” transformations • Combinations of other features Search: How does one find these features? • Brute force has become cheap. 21 Our Choices Wharton Department of Statistics Structure - Linear regression with nonlinearity via interactions - All 2-way and some 3-way, 4-way interactions - Missing data handled with indicators Identification - Conservative standard error - Comparison of conservative t-ratio to adaptive threshold Search - Forward stepwise regression - Coming: Dynamically changing list of features • Good choice affects where you search next. 22 Identifying Predictive Features Classical Wharton Department of Statistics problem of “variable selection” Thresholding methods (compare t-ratio to threshold) - Akaike information criterion (AIC) - Bayes information criterion (BIC) - Hard thresholding and Bonferroni Arguments for adaptive thresholds - Empirical Bayes - Information theory - Step-up/step-down tests 23 Adaptive Thresholding Threshold Wharton Department of Statistics changes to conform to attributes of data - Easier to add features as more are found. Threshold for first predictor - Compare conservative t-ratio to Bonferroni. - Bonferroni is about Sqrt(2 log p) - If something significant is found, continue. Threshold for second predictor - Compare t-ratio to reduced threshold - New threshold is about Sqrt(2 log p/2) 24 Adaptive Thresholding: Benefits Wharton Department of Statistics Easy As easy and fast as implementing the standard criterion that is used in stepwise regression. Theory Resulting model provably as good as best Bayes model for the problem at hand. Real world It works! Finds models with real signal, and stops when the signal runs out. 25 Bankruptcy Model: Construction Data: Wharton Department of Statistics reserve 80% for validation - Training data • 600,000 months • 458 bankruptcies - Validation data • 2,400,000 months • 1786 bankruptcies Selection via adaptive thresholding - Compare sequence of t-statistics to Sqrt(2 log p/q) - Dynamic expansion of feature space 26 Bankruptcy Model: Preview Wharton Department of Statistics Predictors - Initial search identifies 39 • Validation SS monotonically falls to 1650 • Linear fit can do no better than 1735 - Expanded search of higher interactions finds a bit more • Nature of predictors comprising the interactions • Validation SS drops 10 more Validation: Lift chart - Top 1000 candidates have 351 bankrupt More validation: Calibration - Close to actual Pr(bankrupt) for most groups. 27 Bankruptcy Model: Fitting Department of Statistics should the fitting process be stopped? Residual Sum of Squares SS Where Wharton 470 460 450 440 430 420 410 400 0 50 100 150 Number of Predictors 28 Bankruptcy Model: Fitting Wharton Department of Statistics Our adaptive selection procedure stops at a model with 39 predictors. SS Residual Sum of Squares 470 460 450 440 430 420 410 400 0 50 100 150 Number of Predictors 29 Bankruptcy Model: Validation Wharton Department of Statistics The validation indicates that the fit gets better while the model expands. Avoids over-fitting. Validation Sum of Squares 1760 SS 1720 1680 1640 0 50 100 150 Number of Predictors 30 Bankruptcy Model: Linear? Wharton Department of Statistics Choosing from linear predictors (no interactions) does not match the performance of the full search. Validation Sum of Squares 1760 SS 1720 1680 1640 0 50 100 150 Number of Predictors Linear Quadratic 31 Wharton Bankruptcy Model: More? Department of Statistics Searching higher-order interactions offers modest improvement. Validation Sum of Squares SS 1680 1640 0 20 40 60 Number of Predictors Quadratic Cubic 32 Lift Chart Measures Wharton Department of Statistics how well model classifies sought-for group % bankrupt in DM selection Lift % bankrupt in all data Depends on rule used to label customers - Very high threshold Lots of lift, but few bankrupt customers are found. - Lower threshold Lift drops, but finds more bankrupt customers. 33 Wharton Generic Lift Chart Department of Statistics 1.0 Model %Respon ders 0.8 Random 0.6 0.4 0.2 0.0 0 10 20 30 40 50 60 70 80 90 100 % Cho sen 34 Wharton Bankruptcy Model: Lift Much Department of Statistics better than diagonal! 100 % Found 75 50 25 0 0 25 50 75 100 % Contacted 35 Wharton Calibration Classifier assigns Prob(“BR”) rating to a customer. Weather forecast Among those classified as 2/10 chance of “BR”, how many are BR? 10 0 75 Actual Department of Statistics 50 25 0 10 20 30 40 50 60 70 80 90 Closer to diagonal is better. 36 Bankruptcy Model: Calibration Over-predicts Wharton Department of Statistics risk above claimed probability 0.4 Calibration Chart 1.2 Actual 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Claim 37 Summary of Bankruptcy Model Wharton Department of Statistics Automatic, adaptive selection - Finds patterns that predict new observations - Predictive, but not easy to explain Dynamic feature set - Current research - Information theory allows changing search space - Finds more structure than direct search could find Validation - Essential only for judging fit. - Better than “hand-made models” that take years to create. 38 So, where’s the profit in DM? Wharton Department of Statistics Automated modeling has become very powerful, avoiding problems of over-fitting. Role for expert judgment remains - What data to use? - Which features to try first? - What are the economics of the prediction errors? Collaboration - Data sources - Data analysis - Strategic decisions 39