Data Mining Research David L. Olson University of Nebraska

advertisement
Data Mining Research
David L. Olson
University of Nebraska
Data Mining Research
• Business Applications
–
–
–
–
Credit scoring
Customer classification
Fraud detection
Human resource management
• Algorithms
• Database related
– Data warehouse products claim internal data mining
• Text mining
• Data Mining Process
Personal (with others)
• Business Applications
– Introduction to Business Data Mining with Yong Shi [2006]
– Qing Cao - RFM
• Algorithms
– Advanced Data Mining Techniques with Dursun Delen [2008]
– Moshkovich & Mechitov – Ordinal scales in trees
– Data set balancing
• Database related
– encyclopedia
• Text mining
– Web log ethics
• Data Mining Process
– Ton Stam, Dursun Delen
RFM
with Qing Cao, Ching Gu, Donhee Lee
• Recency
– Time since customer made last purchase
• Frequency
– Number of purchases this customer made over
time frame
• Monetary
– Average purchase amount (or total)
Variants
• F & M highly correlated
– Bult & Wansbeek [1995] Journal of Marketing Science
• Value = M/R
– Yang (2004) Journal of Targeting, Measurement and Analysis for
Marketing
Limitations
• Other attributes may be important
– Product variation
– Customer age
– Customer income
– Customer lifestyle
• Still, RFM widely used
– Works well if response rate is high
Data
•
•
•
•
•
•
•
Meat retailer in Nebraska
64,180 purchase orders (mail)
10,000 individual customers
Oct 11, 1998 to Oct 3, 2003
ORDER DATA
ORDER AMOUNT
PRESENCE OF PROMOTION
Data
•
•
•
•
•
Nebraska food products firm
64,180 individual purchase orders (by mail)
10,000 individual customers
11 Oct 1998 to 3 Oct 2003
Data:
– Order date
– Order amount (price)
– Whether or not promotion involved
Treatment
• Used 5,000 observations to build model
– To the end of 2002
• Used another 5,000 for testing
– 2003
Correlations
* - 0.01 significance; ** - 0.05 significance; *** - 0.001 significance
F
M
R
F
M
Response 2003
R
1.0
-0.371
-0.278
0.209 ***
1.0
0.749
0.135 ***
1.0
0.133 ***
Response 2003 $
-0.100 ***
0.534 ***
0.751 ***
Data
Factor
R
Min
1
Max Group1
1542 1233 +
Count
F
548
1
Count
M
Count
20
56
1-11
Group2 Group3 Group4 Group5
925-1232 617-924 309-616 1-308
209
12-22
297
464
23-33
34-44
4503
431
49
13
15199 0-3040
30416060
81
60619080
17
908112100
1
4900
3482
45 +
4
12100 +
1
Count by RFM Cell
RF
R
F
M1
M2
M3
M4
M5
55 R 1-308
54
53
52
51
45 R309-616
44
F 45+
F 34-44
F 23-33
F 12-22
F 1-11
F 45+
F 34-44
1
3
22
355
3003
0
0
2
6
23
36
12
0
0
1
4
4
5
3
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
43
F 23-33
0
0
0
0
0
42
41
35 R617-924
34
33
F 12-22
F 1-11
F 45+
F 34-44
F 23-33
29
433
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
32
31
F 12-22
F 1-11
3
294
0
0
0
0
0
0
0
0
25 R 925-1232
24
F 45+
F 34-44
0
0
0
0
0
0
0
0
0
0
23
F 23-33
0
0
0
0
0
22
F 12-22
0
0
0
0
0
21
F 1-11
209
0
0
0
0
15 R 1233+
F 45+
0
0
0
0
0
14
F 34-44
0
0
0
0
0
13
F 23-33
0
0
0
0
0
12
F 12-22
0
0
0
0
0
11
F 1-11
548
0
0
0
0
Basic Model Coincidence Matrix
Correct 0.6076
Model 0
Model 1
Totals
Actual 0
872
13
885
Actual 1
1949
2166
4114
Totals
2821
2179
5000
BALANCE CELLS
• Adjusted boundaries of 5 x 5 x 5 matrix
• Can’t get all to equal average of 8
– Lumpy (due to ties)
– Ranged from 4 to 11
Balanced Cell Densities
Correct 0.8380
RF
55
54
53
52
51
45
44
43
42
41
35
34
33
32
31
25
24
23
22
21
15
14
13
12
11
R
R 1-22
R 23-48
R 49-151
R 152-672
R 673+
F
F9+
M1
43
M2
41
M3
41
M4
41
M5
42
F 6-8
F 4-5
F=3
F 1-2
F9+
F 6-8
F 4-5
F=3
F 1-2
F9+
F 6-8
F 4-5
F=3
F 1-2
F9+
F 6-8
F 4-5
F=3
F 1-2
F9+
F 6-8
F 4-5
43
57
30
26
41
34
58
40
19
63
49
43
29
19
38
50
51
32
29
16
15
30
44
64
31
25
41
38
56
36
20
64
50
43
21
23
38
49
51
33
25
15
16
30
43
63
35
25
41
36
57
37
18
62
50
44
28
19
38
51
51
31
33
15
15
30
43
61
34
21
41
38
57
38
22
63
50
45
27
18
38
50
51
32
26
15
16
30
44
63
34
24
41
37
57
37
20
63
50
44
28
21
38
50
51
33
30
15
15
30
F=3
F 1-2
59
67
70
73
63
93
64
76
64
69
Alternatives
• LIFT
– Sort groups by best response
– Apply your marketing budget to the most profitable (until
you run out of budget)
– LIFT is the gain obtained above par (random)
• VALUE FUNCTION
• (Yang, 2004)
– Throw out F (correlated with M)
– Use ratio of M/R
• Logistic Regression
• Decision Tree
• Neural Network
LIFT
Equal Groups
V Value by Cell
Cell
Min
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Totals
0
0.0464
0.108
0.195
0.376
0.72
1.252
1.952
2.73
3.74
4.972
6.524
8.6
11.08
14.34
18.35
24.15
32.87
48.2
92
n
0
424
286
281
352
303
285
292
319
293
229
231
254
218
216
191
207
166
148
175
130
5000
1
3
2
1
0
2
9
57
120
101
102
87
86
75
71
49
46
30
17
16
11
885
421
284
280
352
301
276
235
199
192
127
144
168
143
145
142
161
136
131
159
119
4115
%
0.993
0.993
0.996
1.000
0.993
0.968
0.805
0.624
0.655
0.555
0.623
0.661
0.656
0.671
0.743
0.778
0.819
0.885
0.909
0.915
0.823
Avg$
94.42
101.26
107.18
108.25
119.99
136.13
127.05
98.31
101.01
101.69
102.33
107.12
119.83
119.99
122.55
157.82
159.17
220.18
284.74
424.69
131.33
V Model Lift
Models
• Regression:
-0.4775 + 0.00853 R + 0.1675 F + 0.00213 M
Test data: Correct 0.8230
• Decision Tree
IF R ≤ 82
AND R ≤ 32
YES (1567 right, 198 wrong)
ELSE R > 32
AND F ≤ 3
AND M ≤ 296
NO (285 right, 91 wrong)
ELSE M > 296
YES (28 right, 9 wrong)
ELSE F > 3
YES (729 right, 110 wrong)
ELSE R > 82
YES (2391 right, 3 wrong)
Test data: Correct 0.8678
• Neural Network
Test data: Correct 0.8674
COMPARISONS
Model
Test Accuracy
Benefits
Drawbacks
RFM
0.6076
Simplest data
Uneven cell densities
Degenerate (all 1)
0.8230
Balanced cell sizes
0.7156
Better statistically
More data manipulation
Balanced cell sizes $ 0.8380
Better statistically
Value function
0.8180
Condense to one IV
Less information
Logistic regression
0.8230
Additional IVs
Formula hard to apply
(degenerate)
Decision tree
0.8678
Easy to interpret
Neural network
0.8674
Fit nonlinear data
Hard to apply model
Download