Data Mining Research David L. Olson University of Nebraska Data Mining Research • Business Applications – – – – Credit scoring Customer classification Fraud detection Human resource management • Algorithms • Database related – Data warehouse products claim internal data mining • Text mining • Data Mining Process Personal (with others) • Business Applications – Introduction to Business Data Mining with Yong Shi [2006] – Qing Cao - RFM • Algorithms – Advanced Data Mining Techniques with Dursun Delen [2008] – Moshkovich & Mechitov – Ordinal scales in trees – Data set balancing • Database related – encyclopedia • Text mining – Web log ethics • Data Mining Process – Ton Stam, Dursun Delen RFM with Qing Cao, Ching Gu, Donhee Lee • Recency – Time since customer made last purchase • Frequency – Number of purchases this customer made over time frame • Monetary – Average purchase amount (or total) Variants • F & M highly correlated – Bult & Wansbeek [1995] Journal of Marketing Science • Value = M/R – Yang (2004) Journal of Targeting, Measurement and Analysis for Marketing Limitations • Other attributes may be important – Product variation – Customer age – Customer income – Customer lifestyle • Still, RFM widely used – Works well if response rate is high Data • • • • • • • Meat retailer in Nebraska 64,180 purchase orders (mail) 10,000 individual customers Oct 11, 1998 to Oct 3, 2003 ORDER DATA ORDER AMOUNT PRESENCE OF PROMOTION Data • • • • • Nebraska food products firm 64,180 individual purchase orders (by mail) 10,000 individual customers 11 Oct 1998 to 3 Oct 2003 Data: – Order date – Order amount (price) – Whether or not promotion involved Treatment • Used 5,000 observations to build model – To the end of 2002 • Used another 5,000 for testing – 2003 Correlations * - 0.01 significance; ** - 0.05 significance; *** - 0.001 significance F M R F M Response 2003 R 1.0 -0.371 -0.278 0.209 *** 1.0 0.749 0.135 *** 1.0 0.133 *** Response 2003 $ -0.100 *** 0.534 *** 0.751 *** Data Factor R Min 1 Max Group1 1542 1233 + Count F 548 1 Count M Count 20 56 1-11 Group2 Group3 Group4 Group5 925-1232 617-924 309-616 1-308 209 12-22 297 464 23-33 34-44 4503 431 49 13 15199 0-3040 30416060 81 60619080 17 908112100 1 4900 3482 45 + 4 12100 + 1 Count by RFM Cell RF R F M1 M2 M3 M4 M5 55 R 1-308 54 53 52 51 45 R309-616 44 F 45+ F 34-44 F 23-33 F 12-22 F 1-11 F 45+ F 34-44 1 3 22 355 3003 0 0 2 6 23 36 12 0 0 1 4 4 5 3 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 43 F 23-33 0 0 0 0 0 42 41 35 R617-924 34 33 F 12-22 F 1-11 F 45+ F 34-44 F 23-33 29 433 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 32 31 F 12-22 F 1-11 3 294 0 0 0 0 0 0 0 0 25 R 925-1232 24 F 45+ F 34-44 0 0 0 0 0 0 0 0 0 0 23 F 23-33 0 0 0 0 0 22 F 12-22 0 0 0 0 0 21 F 1-11 209 0 0 0 0 15 R 1233+ F 45+ 0 0 0 0 0 14 F 34-44 0 0 0 0 0 13 F 23-33 0 0 0 0 0 12 F 12-22 0 0 0 0 0 11 F 1-11 548 0 0 0 0 Basic Model Coincidence Matrix Correct 0.6076 Model 0 Model 1 Totals Actual 0 872 13 885 Actual 1 1949 2166 4114 Totals 2821 2179 5000 BALANCE CELLS • Adjusted boundaries of 5 x 5 x 5 matrix • Can’t get all to equal average of 8 – Lumpy (due to ties) – Ranged from 4 to 11 Balanced Cell Densities Correct 0.8380 RF 55 54 53 52 51 45 44 43 42 41 35 34 33 32 31 25 24 23 22 21 15 14 13 12 11 R R 1-22 R 23-48 R 49-151 R 152-672 R 673+ F F9+ M1 43 M2 41 M3 41 M4 41 M5 42 F 6-8 F 4-5 F=3 F 1-2 F9+ F 6-8 F 4-5 F=3 F 1-2 F9+ F 6-8 F 4-5 F=3 F 1-2 F9+ F 6-8 F 4-5 F=3 F 1-2 F9+ F 6-8 F 4-5 43 57 30 26 41 34 58 40 19 63 49 43 29 19 38 50 51 32 29 16 15 30 44 64 31 25 41 38 56 36 20 64 50 43 21 23 38 49 51 33 25 15 16 30 43 63 35 25 41 36 57 37 18 62 50 44 28 19 38 51 51 31 33 15 15 30 43 61 34 21 41 38 57 38 22 63 50 45 27 18 38 50 51 32 26 15 16 30 44 63 34 24 41 37 57 37 20 63 50 44 28 21 38 50 51 33 30 15 15 30 F=3 F 1-2 59 67 70 73 63 93 64 76 64 69 Alternatives • LIFT – Sort groups by best response – Apply your marketing budget to the most profitable (until you run out of budget) – LIFT is the gain obtained above par (random) • VALUE FUNCTION • (Yang, 2004) – Throw out F (correlated with M) – Use ratio of M/R • Logistic Regression • Decision Tree • Neural Network LIFT Equal Groups V Value by Cell Cell Min 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Totals 0 0.0464 0.108 0.195 0.376 0.72 1.252 1.952 2.73 3.74 4.972 6.524 8.6 11.08 14.34 18.35 24.15 32.87 48.2 92 n 0 424 286 281 352 303 285 292 319 293 229 231 254 218 216 191 207 166 148 175 130 5000 1 3 2 1 0 2 9 57 120 101 102 87 86 75 71 49 46 30 17 16 11 885 421 284 280 352 301 276 235 199 192 127 144 168 143 145 142 161 136 131 159 119 4115 % 0.993 0.993 0.996 1.000 0.993 0.968 0.805 0.624 0.655 0.555 0.623 0.661 0.656 0.671 0.743 0.778 0.819 0.885 0.909 0.915 0.823 Avg$ 94.42 101.26 107.18 108.25 119.99 136.13 127.05 98.31 101.01 101.69 102.33 107.12 119.83 119.99 122.55 157.82 159.17 220.18 284.74 424.69 131.33 V Model Lift Models • Regression: -0.4775 + 0.00853 R + 0.1675 F + 0.00213 M Test data: Correct 0.8230 • Decision Tree IF R ≤ 82 AND R ≤ 32 YES (1567 right, 198 wrong) ELSE R > 32 AND F ≤ 3 AND M ≤ 296 NO (285 right, 91 wrong) ELSE M > 296 YES (28 right, 9 wrong) ELSE F > 3 YES (729 right, 110 wrong) ELSE R > 82 YES (2391 right, 3 wrong) Test data: Correct 0.8678 • Neural Network Test data: Correct 0.8674 COMPARISONS Model Test Accuracy Benefits Drawbacks RFM 0.6076 Simplest data Uneven cell densities Degenerate (all 1) 0.8230 Balanced cell sizes 0.7156 Better statistically More data manipulation Balanced cell sizes $ 0.8380 Better statistically Value function 0.8180 Condense to one IV Less information Logistic regression 0.8230 Additional IVs Formula hard to apply (degenerate) Decision tree 0.8678 Easy to interpret Neural network 0.8674 Fit nonlinear data Hard to apply model