Submission Document

advertisement
Kaggle - Dunnhumby's Shopper Challenge
2nd Place Finish Methodology
Neil Schneider
Neil.Schneider@Milliman.com
Introduction
The Dunnhumby's Shopper Challenge asked for models to predict the shopping habits of customers
followed by a grocery chains customer loyalty cards. The goal was predict the date of return and the
spend amount during that visit. The scoring metric was the percentage right for selecting the correct
day of return and being within $10 of the actual spend amount. The paper will outline the strategy used
to correctly predict 18.67% of the shopper's habits. A slight alternative to this method, which was not
selected for inclusion in the competition, will also be discussed. This method improved the score to
18.83%. All methodologies were developed using SAS, R, and JMP.
Data Preparations
The raw shopper data was processed to add additional variables before beginning any analyses. Both
the training and testing sets, went through a 4 step process. All code can be found in Appendix A. The
first step was to limit the time period to dates between April 1, 2010 and March 31, 2011. Step two
involved creating additional variables for the Day of the Week and the Day of the Month of each
purchase. The third step involved retaining the previous visits information to be used in the final step to
create a days between visits variable. Control processes were applied to the training set and retained
visit information was nullified for the first entry of each shopper. Unfortunately, in the testing set the
first visit for each customer included the last visits information from the previous customer. The impact
of this error should be negligible if there is any impact at all. The processed data were also used to
create a Last Visit set. This set included the last entry for each shopper and created a variable indicating
the number of days already passed since his or her last visit.
One more set was created from the raw data. This was the results for the customers in the training set.
There was one entry per shopper and it was the first return date after March 31, 2011. It only included
the customer name, visit_date and visit_spend.
Visit_Spend Methodology
All discussed methods used the same spend amounts. The amounts will differ based on the projected
day of the week for the shopper's return, but the methodology is the same. A member's next spend
amount was developed on historical data only. There was no training a model on data past March 31,
2011. Training data are used later to optimize method selection.
The chosen method optimizes the results based on the testing statistic for this competition. The metric for
determining if the projected visit spend amount is correct was being with $10 of the actual spend amount.
Maximizing the number of spends within the $20 window was accomplished by empirically calculating the
$20 range that a customer most often spends. I termed this window the Modal Range. Typically, it is
less than both the mean and the median of a customer's spending habits. Predictions were further
enhanced by determining a modal range for each day of the week. In the final submissions, these values
were also front weighted by triangle weighting the dates from April 1, 2010. (A spend on April 1, 2010
has a weight of one and a spend on March 31, 2011 has a weight of 365.)
The projected visit spend was based off the day of the week of the projected visit date. In cases where
the customer does not have enough experience on the return day of the week, their overall modal range
is assumed. The training data were used to develop credibility thresholds for selecting a customer's daily
modal range verse their overall modal range. The thresholds were hard cutoffs. If the customer did not
have enough experience on a particular day of the week, the overall modal range was projected. The
overall modal range was not front weighted like the daily ranges.
Optimized thresholds base on the training data are found in the following table.
Day of the Week
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Experience Threshold
15
17
22
19
16
26
21
*These thresholds should have been optimized based on the actual return visit date,
but may have been optimized for a previous date projection method.
Visit_Date Methodology
The key modeling technique used in predicting a customer's return date was a Generlized Boosting
Regression Model. In R, the "gbm" package provided the programming for this model. Individual
models were developed for each possible day of return from April 1, 2011 through April 9, 2011. April
9th was chosen as the last day because of time constraints and the empirical distribution of the training
data. Each day was considered to be a Bernoulli, for whether the customer first returned on that date.
The first step was to create a suite of explanatory variable to use in the GBM regression. Many of the
variables used were developed in previous methods. Ultimately, 106 variables were entered into the
GBM model. There are sets of variables that may have been perfectly correlated with other columns,
but that does not matter in GBM modeling. The table below will list the variables:
Variable
Wkday_prior_freq
Avg_days_prior
Med_days_prior
Mod_days_prior
stdev_days_prior
Wkday_weight
Wkday_freq
Wkday_prob
Wkday_prob_strength
Number of this
Type
7 - one per day of
the week
7 - one per day of
the week
7 - one per day of
the week
7 - one per day of
the week
7 - one per day of
the week
7 - one per day of
the week
7 - one per day of
the week
7 - one per day of
the week
7 - one per day of
the week
Wkday_total_weight
1
Wkday_total_freq
1
Days_since_freq
36 - one per
common days
between visits
Days_since_total_freq
1
Last_spend
1
Last_dayofmth
1
Last_day
1
Days_already_past
1
Description
Counts of visits based on the previous visit's
day of the week.
The average number of days since the
previous visit
The median number of days since the
previous visit
The mode number of days since the previous
visit
The standard deviation of days since the
previous visit
Sum of weighted visits for each day of the
week
Count of visits for each day of the week
The probability of a customer visiting on a
day of the week based on the weights.
The wkday_prob variable multiplied by the
wkday_freq variable.
Sum of wkday_weights for each customer
Sum of wkday_freq for each customer
Count of visits after common number of days
between visits.
Sum of Days_since_freq variable for each
customer
The spend amount of the last visit before
April 1, 2011
The day of the month of the last visit before
April 1, 2011
The day of the week of the last visit before
April 1, 2011
The number of days between the last visit
and April 1, 2011
The next step was to create the models. Each date had a model designed for it. Due to processing time
parameters such as the number of cross-validation folds, number of trees and the shrinkage were
changed between dates. These parameters can be extracted from the GBM class data provided with
this documentation. Cross-validation folds were either 5 or 10. The number of trees ranged from
30,000 to 50,000 and the shrinkage ranges from 0.0001 to 0.002. The interaction depth also ranged
from 2 to 3. The best iteration number was based on the Cross-validation method.
Finally, the test data was run through the GBM models. The set of nine scores from the models were
compared. In the final selected model, the maximum score for each customer was chosen. This
maximum represented the expected day of return. This method produced the second place finish
submission. In another submission, I ran a nominal logistic regression on a customer's date of return
and the scores from the GBM models. This model was then used to project the test customers' date of
return. This additional step added 0.16 to the final score, but was not submitted for consideration.
Control of Work Product
Attempts to recreate the winning submission were slowed due to poor version control as the
competition came to a close. The training sets used in the GBM model had two errors for each
customer. The Wkday_prior_freq variable had one visit shifted from one day to another. This also
created an issue with avg, med, and mod _days_prior variables. The average variable is the most
influenced by outliers and had the biggest impact. While, these errors were corrected in SAS, it was not
updated in R. In the test data used to create the final predictions, there was only an error in the count
of wkday_prior_freq. The error was the inclusion of one additional count that should not have been
included. Code to create the erroneous data is included in the SAS programs. All erroneous sets end in
the suffix "_bad". These errors should not have hurt the predictive power of the model and correcting
them should improve on or maintain the current results.
Code
Code to recreate this submission is included in its entirety in the appendix and in separately attached zip
file: Winning Submission.zipx.
APPENDIX
*******************************************************;
*
*;
*
Kaggle - Dunnhumby's Shoppers Challenge
*;
*
*;
*
Written by Neil Schneider
*;
*
*;
*
Program processes raw shopper data for use in
*;
*
subsequent programs.
*;
*
*;
*******************************************************;
libname Drive 'C:\Dunnhumby\Winning Submission\';
libname raw 'C:\Dunnhumby\SAS Data';
*Bring in and modify the raw data;
*Step 1: Limit data to explanatory period 4/1/2010-3/31/2011;
*Step 2: Create "Day of Week" and "Day of Month" variables;
*Step 3: Retain Date, Spend, Day of Week, and Day of Month variable from the
previous record.;
*NOTE: Additional code to reset the first entry for each member was not
applied to the testing data set;
*Step 4: Create the "Days Since Prior" variable. This is the time between
visits;
data drive.train_set_mod;
set raw.Train_set;
by customer_id visit_date;
where visit_date ge '01APR2010'd and visit_date lt '01APR2011'd;
dayofmth = day(visit_date);
day = weekday(visit_date);
format prior_date yymmdd10.;
prior_date = lag1(visit_date);
prior_spend = lag1(visit_spend);
prior_day = weekday(prior_date);
prior_dayofmth = day(prior_date);
if first.customer_id then do;
days_since_prior = .;
prior_date = .;
prior_spend = .;
prior_day = .;
prior_dayofmth = .;
end;
else days_since_prior = visit_date - prior_date;
run;
***THIS IS THE CONSEQUENCE OF BAD VERSION CONTROL***;
*This data is used in the GBM regressions.;
*I guess this proves the system is robust to still produce
acceptable answers;
*Drop first customer_id since there is not prior information to
fill in on this line;
data train_set_mod_bad;
set raw.Train_set;
by customer_id;
if ^first.customer_id then output;
run;
*Use the prior information to fill in information.
All customers after the first will have a erroneous entry derived
from the previous customer's last record;
data drive.train_set_mod_bad;
set train_set_mod_bad;
by customer_id visit_date;
where visit_date ge '01APR2010'd and visit_date lt
'01APR2011'd;
dayofmth = day(visit_date);
day = weekday(visit_date);
format prior_date yymmdd10.;
prior_date = lag1(visit_date);
prior_spend = lag1(visit_spend);
prior_day = weekday(prior_date);
prior_dayofmth = day(prior_date);
days_since_prior = visit_date - prior_date;
run;
*Do this correctly for the test group;
data drive.test_set_mod;
set raw.test_set;
by customer_id visit_date;
where visit_date ge '01APR2010'd and visit_date lt '01APR2011'd;
dayofmth = day(visit_date);
day = weekday(visit_date);
format prior_date yymmdd10.;
prior_date = lag1(visit_date);
prior_spend = lag1(visit_spend);
prior_day = weekday(prior_date);
prior_dayofmth = day(prior_date);
days_since_prior = visit_date - prior_date;
if first.customer_id then do;
days_since_prior = .;
prior_date = .;
prior_spend = .;
prior_day = .;
prior_dayofmth = .;
end;
else days_since_prior = visit_date - prior_date;
run;
***THIS IS THE CONSEQUENCE OF BAD VERSION CONTROL***;
*This data is used in the GBM regressions.;
*I guess this proves the system is robust to still produce
acceptable answers;
*Use the prior information to fill in information.
All customers after the first will have a erroneous entry derived
from the previous customer's last record;
data drive.test_set_mod_bad;
set raw.test_set;
by customer_id visit_date;
where visit_date ge '01APR2010'd and visit_date lt
'01APR2011'd;
dayofmth = day(visit_date);
day = weekday(visit_date);
format prior_date yymmdd10.;
prior_date = lag1(visit_date);
prior_spend = lag1(visit_spend);
prior_day = weekday(prior_date);
prior_dayofmth = day(prior_date);
If visit_date gt prior_date then days_since_prior =
visit_date - prior_date;
else days_since_prior = .;
run;
*Using the sets created above.;
*Retain the last entry for each shopper;
*Create a variable counting the days passed since the last visit.;
data drive.train_last_spend;
set drive.train_set_mod;
by customer_id;
days_already_past = '01APR2011'd - visit_date;
rename visit_date = last_date visit_spend=last_spend
dayofmth = last_dayofmth day=last_day
prior_date = sec_last_date prior_spend=sec_last_spend
prior_dayofmth=sec_last_dayofmth prior_day=sec_last_day
days_since_prior=last_days_since;
if last.customer_id;
run;
*Using the sets created above.;
*Retain the last entry for each shopper;
*Create a variable counting the days passed since the last visit.;
data drive.test_last_spend;
set drive.test_set_mod;
by customer_id;
days_already_past = '01APR2011'd - visit_date;
rename visit_date = last_date visit_spend=last_spend
dayofmth = last_dayofmth day=last_day
prior_date = sec_last_date prior_spend=sec_last_spend
prior_dayofmth=sec_last_dayofmth prior_day=sec_last_day
days_since_prior=last_days_since;
if last.customer_id;
run;
*Create training set results;
data drive.train_set_results_01APR2011;
set raw.train_set;
where visit_date ge '01APR2011'd;
by customer_id visit_date;
if first.customer_id;
run;
*Create Results dummy flags for R GBM regressions;
data drive.train_date_results_dummyflag;
set drive.train_set_results_01APR2011;
drop visit_spend;
*create date
apr_01
apr_02
apr_03
apr_04
apr_05
apr_06
apr_07
apr_08
apr_09
apr_10
apr_11
apr_12
apr_13
apr_14
apr_15
apr_16
apr_xx
flags;
= -1;
= -1;
= -1;
= -1;
= -1;
= -1;
= -1;
= -1;
= -1;
= -1;
= -1;
= -1;
= -1;
= -1;
= -1;
= -1;
= -1;
select (visit_date);
when ('01APR2011'd) apr_01
when ('02APR2011'd) apr_02
when ('03APR2011'd) apr_03
when ('04APR2011'd) apr_04
when ('05APR2011'd) apr_05
when ('06APR2011'd) apr_06
when ('07APR2011'd) apr_07
when ('08APR2011'd) apr_08
when ('09APR2011'd) apr_09
when ('10APR2011'd) apr_10
when ('11APR2011'd) apr_11
when ('12APR2011'd) apr_12
when ('13APR2011'd) apr_13
when ('14APR2011'd) apr_14
when ('15APR2011'd) apr_15
when ('16APR2011'd) apr_16
otherwise apr_xx = 1;
end;
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
1;
1;
1;
1;
1;
1;
1;
1;
1;
1;
1;
1;
1;
1;
1;
1;
run;
*Output CSV file for R;
proc export data = drive.train_date_results_dummyflag
outfile = 'C:\Dunnhumby\Winning
Submission\train_date_results.csv'
DBMS = csv replace;
run;
*************************************************************;
*
*;
*
Kaggle - Dunnhumby's Shoppers Challenge
*;
*
*;
*
Written by Neil Schneider
*;
*
*;
*
Program develops projections for shoppers' date of
*;
*
return to the store.
*;
*
*;
*************************************************************;
libname Drive 'C:\Dunnhumby\Winning Submission\';
********************************************************;
****************Training Data***************************;
********************************************************;
data train_set_date_mod;
merge drive.train_set_mod
drive.train_last_spend (keep= customer_id
days_already_past);
by customer_id;
run;
***THIS IS THE CONSEQUENCE OF BAD VERSION CONTROL***;
data train_set_date_mod_bad;
merge drive.train_set_mod_bad
drive.train_last_spend (keep= customer_id
days_already_past);
by customer_id;
run;
*Date prediction is based on Generalized Boosting Regression Model.;
*The following code creates variables to be considered by the GBM
function.;
**********************************************************************;
*************Days since last purchase frequencies*********************;
**********************************************************************;
*Determine frequency of the days between visits for a
customer;
*Only allows for days between visit that are greater than
the time already elasped;
proc summary missing data=train_set_date_mod;
where days_since_prior ge days_already_past;
class customer_id days_since_prior days_already_past;
output out=train_freq_days_since ;
run;
*Type 4 in the summary is the customer level frequency for
all eligible visits;
*Type 6 in the summary is the customer and days between
visits frequency;
*Only day frequencies over 5 are included;
data train_prob_days_since (drop= days_already_past _TYPE_
rename=mem_freq=days_since_total_freq
rename=_FREQ_=days_since_freq);
merge train_freq_days_since (where= (_TYPE_=4)
rename=(_FREQ_=mem_freq))
train_freq_days_since (where= (_TYPE_=6))
;
by customer_id;
days_since_prob = _FREQ_ / mem_freq;
if _FREQ_ gt 5;
run;
*x1 set is the number of visits for each days between visit
for each customer;
*null values are included when customers have a frequency
of 0-5;
proc transpose data=train_prob_days_since
out=train_prob_days_since_x1 (drop=_NAME_)
label=days_since_prior
prefix=days_since_freq;
by customer_id;
id days_since_prior;
var days_since_freq;
run;
*x2 set is the calculated probability of customers visiting
on that day;
*Again null values are created as above;
proc transpose data=train_prob_days_since
out=train_prob_days_since_x2 (drop=_NAME_)
label=days_since_prior
prefix=days_since_prob;
by customer_id;
id days_since_prior;
var days_since_prob;
run;
*Keep one entry per customer to include the total frequency
considered;
data train_prob_days_since_x3;
set train_prob_days_since;
by customer_id;
if first.customer_id;
keep customer_id days_since_total_freq;
run;
*Combined the three prior sets to create a days between
visit set to be included in the GBM regression;
data train_prob_days_since_x;
merge train_prob_days_since_x1
train_prob_days_since_x2
train_prob_days_since_x3;
by customer_id;
run;
*******************************************************************;
*************Day of week purchase frequencies**********************;
*******************************************************************;
*To calculate the probabilities of visits by a customer for
a given day of the week.
value of the weights by day are needed;
*The weights are the cube root of triangle front
weighting.;
proc sql;
create table train_freq_wkday as
select
customer_id
,day
,sum((visit_date - '01APR2010'd)**(1/3)) as
wkday_weight
,count(*) as wkday_freq
from train_set_date_mod
group by customer_id
,day
;
quit;
run;
*to calculate the probabilities of visits by a customer for
a given day of the week,
the total weight is needed;
*The weights are the cube root of triangle front
weighting.;
proc sql;
create table train_freq_wkday_total as
select
customer_id
,sum((visit_date - '01APR2010'd)**(1/3)) as
wkday_total_weight
,count(*) as wkday_total_freq
from train_set_date_mod
group by customer_id
;
quit;
run;
*The weighted frequencies and count of visits are combined;
*This step creates the probability of a visit for a day of
the week;
*Also a "strength" metric was created, this is the
probability weighted by the frequency;
data train_prob_day;
merge train_freq_wkday
train_freq_wkday_total
;
by customer_id;
wkday_prob = wkday_weight / wkday_total_weight;
wkday_prob_strength = wkday_prob * wkday_freq;
run;
*Transpose the weights for each day into a single row
for each customer;
proc transpose data=train_prob_day
out=train_prob_day_x1 (drop=_NAME_)
label=day
prefix=wkday_weight;
by customer_id;
id day;
var wkday_weight;
run;
*Transpose the frequency counts for each day into a single
row for each customer;
proc transpose data=train_prob_day
out=train_prob_day_x2 (drop=_NAME_)
label=day
prefix=wkday_freq;
by customer_id;
id day;
var wkday_freq;
run;
*Transpose the day of week probabilities for each day into
a single row for each customer;
proc transpose data=train_prob_day
out=train_prob_day_x3 (drop=_NAME_)
label=day
prefix=wkday_prob;
by customer_id;
id day;
var wkday_prob;
run;
*Transpose the strength of the day of week probabilities
for each day into a single row for each customer;
proc transpose data=train_prob_day
out=train_prob_day_x4 (drop=_NAME_)
label=day
prefix=wkday_prob_strength;
by customer_id;
id day;
var wkday_prob_strength;
run;
*Create a set for the total frequency and total weight for
a customer;
data train_prob_day_x5;
set train_prob_day;
by customer_id;
if first.customer_id;
keep customer_id wkday_total_freq wkday_total_weight;
run;
*Combine the prior group of datasets into a single set for
inclusion in the master set for the GBM regression;
data train_prob_day_x;
merge train_prob_day_x1
train_prob_day_x2
train_prob_day_x3
train_prob_day_x4
train_prob_day_x5
;
by customer_id;
run;
**********************************************************************;
********************Days Since Last Statistics************************;
**********************************************************************;
***THIS IS THE CONSEQUENCE OF BAD VERSION CONTROL***;
*Additional statistics about the days between visits to be
included in the GBM regression;
*Based on the day of the week of the previous visit;
proc summary nway missing data=train_set_date_mod_bad;
where prior_day ne .;
class customer_id prior_day;
var days_since_prior;
output out=train_prob_days_since_stats
(drop= _TYPE_ rename=_FREQ_=wkday_prior_freq)
mean=avg_days_prior
median=med_days_prior
mode=mod_days_prior
std=stdev_wkday_prior;
run;
*Group of transposes to change all all day of week
statistics into a single row for each customer;
proc transpose data=train_prob_days_since_stats
out=train_prob_days_since_stats_x1 (drop=_NAME_)
label=prior_day
prefix=wkday_prior_freq;
by customer_id;
id prior_day;
var wkday_prior_freq;
run;
proc transpose data=train_prob_days_since_stats
out=train_prob_days_since_stats_x2 (drop=_NAME_)
label=prior_day
prefix=avg_days_prior;
by customer_id;
id prior_day;
var avg_days_prior;
run;
proc transpose data=train_prob_days_since_stats
out=train_prob_days_since_stats_x3 (drop=_NAME_)
label=prior_day
prefix=med_days_prior;
by customer_id;
id prior_day;
var med_days_prior;
run;
proc transpose data=train_prob_days_since_stats
out=train_prob_days_since_stats_x4 (drop=_NAME_)
label=prior_day
prefix=mod_days_prior;
by customer_id;
id prior_day;
var mod_days_prior;
run;
proc transpose data=train_prob_days_since_stats
out=train_prob_days_since_stats_x5 (drop=_NAME_)
label=prior_day
prefix=stdev_wkday_prior;
by customer_id;
id prior_day;
var stdev_wkday_prior;
run;
*Group combination to include in the master set.;
data train_prob_days_since_stats_x;
merge train_prob_days_since_stats_x1
train_prob_days_since_stats_x2
train_prob_days_since_stats_x3
train_prob_days_since_stats_x4
train_prob_days_since_stats_x5
;
by customer_id;
run;
**********************************************************************;
*****************Create Train Master Set******************************;
**********************************************************************;
*This set includes all the information and statistics to be
include in the GBM regression;
*Also for the training set the first visit post March 31,
2011 is included;
*"If results" clause is used to drop the 48 customers who
do not return after March 31, 2010.;
data drive.train_master_set;
format customer_id last_date;
merge train_prob_days_since_stats_x
train_prob_day_x
train_prob_days_since_x
drive.train_last_spend
(drop= last_days_since sec:)
drive.train_set_results_01apr2011
(in=results)
;
by customer_id;
if results;
drop _NAME_;
run;
*Output CSV file for R;
proc export data = drive.train_master_set
outfile =
'C:\Dunnhumby\Winning Submission\train_master_set_R.csv'
DBMS = csv replace;
run;
********************************************************;
******************Test Data*****************************;
********************************************************;
data test_set_date_mod;
merge drive.test_set_mod
drive.test_last_spend (keep= customer_id
days_already_past);
by customer_id;
run;
***THIS IS THE CONSEQUENCE OF BAD VERSION CONTROL***;
data test_set_date_mod_bad;
merge drive.test_set_mod_bad
drive.test_last_spend (keep= customer_id
days_already_past);
by customer_id;
run;
*Date prediction is based on Generalized Boosting Regression Model.;
*The following code creates variables to be considered by the GBM
function.;
*All summaries below match the summaries for the training data above.
**********************************************************************;
***************Days since last purchase frequencies*******************;
**********************************************************************;
proc summary missing data=test_set_date_mod;
where days_since_prior ge days_already_past;
class customer_id days_since_prior days_already_past;
output out=test_freq_days_since ;
run;
data test_prob_days_since (drop= days_already_past _TYPE_
rename=mem_freq=days_since_total_freq
rename=_FREQ_=days_since_freq);
merge test_freq_days_since (where= (_TYPE_=4)
rename=(_FREQ_=mem_freq))
test_freq_days_since (where= (_TYPE_=6))
;
by customer_id;
days_since_prob = _FREQ_ / mem_freq;
if _FREQ_ gt 5;
run;
proc transpose data=test_prob_days_since
out=test_prob_days_since_x1 (drop=_NAME_)
label=days_since_prior
prefix=days_since_freq;
by customer_id;
id days_since_prior;
var days_since_freq;
run;
proc transpose data=test_prob_days_since
out=test_prob_days_since_x2 (drop=_NAME_)
label=days_since_prior
prefix=days_since_prob;
by customer_id;
id days_since_prior;
var days_since_prob;
run;
data test_prob_days_since_x3;
set test_prob_days_since;
by customer_id;
if first.customer_id;
keep customer_id days_since_total_freq;
run;
data test_prob_days_since_x;
merge test_prob_days_since_x1
test_prob_days_since_x2
test_prob_days_since_x3;
by customer_id;
run;
**********************************************************************;
***************Day of week purchase frequencies***********************;
**********************************************************************;
proc sql;
create table test_freq_wkday as
select
customer_id
,day
,sum((visit_date - '01APR2010'd)**(1/3)) as
wkday_weight
,count(*) as wkday_freq
from test_set_date_mod
group by customer_id
,day
;
quit;
run;
proc sql;
create table test_freq_wkday_total as
select
customer_id
,sum((visit_date - '01APR2010'd)**(1/3)) as
wkday_total_weight
,count(*) as wkday_total_freq
from test_set_date_mod
group by customer_id
;
quit;
run;
data test_prob_day;
merge test_freq_wkday
test_freq_wkday_total
;
by customer_id;
wkday_prob = wkday_weight / wkday_total_weight;
wkday_prob_strength = wkday_prob * wkday_freq;
run;
proc transpose data=test_prob_day
out=test_prob_day_x1 (drop=_NAME_)
label=day
prefix=wkday_weight;
by customer_id;
id day;
var wkday_weight;
run;
proc transpose data=test_prob_day
out=test_prob_day_x2 (drop=_NAME_)
label=day
prefix=wkday_freq;
by customer_id;
id day;
var wkday_freq;
run;
proc transpose data=test_prob_day
out=test_prob_day_x3 (drop=_NAME_)
label=day
prefix=wkday_prob;
by customer_id;
id day;
var wkday_prob;
run;
proc transpose data=test_prob_day
out=test_prob_day_x4 (drop=_NAME_)
label=day
prefix=wkday_prob_strength;
by customer_id;
id day;
var wkday_prob_strength;
run;
data test_prob_day_x5;
set test_prob_day;
by customer_id;
if first.customer_id;
keep customer_id wkday_total_freq wkday_total_weight;
run;
data test_prob_day_x;
merge test_prob_day_x1
test_prob_day_x2
test_prob_day_x3
test_prob_day_x4
test_prob_day_x5
;
by customer_id;
run;
*********************************************************************;
********************Days Since Last Statistics***********************;
*********************************************************************;
proc summary nway missing data=test_set_date_mod_bad;
where prior_day ne .;
class customer_id prior_day;
var days_since_prior;
output out=test_prob_days_since_stats
(drop= _TYPE_ rename=_FREQ_=wkday_prior_freq)
mean=avg_days_prior
median=med_days_prior
mode=mod_days_prior
std=stdev_wkday_prior;
run;
proc transpose data=test_prob_days_since_stats
out=test_prob_days_since_stats_x1 (drop=_NAME_)
label=prior_day
prefix=wkday_prior_freq;
by customer_id;
id prior_day;
var wkday_prior_freq;
run;
proc transpose data=test_prob_days_since_stats
out=test_prob_days_since_stats_x2 (drop=_NAME_)
label=prior_day
prefix=avg_days_prior;
by customer_id;
id prior_day;
var avg_days_prior;
run;
proc transpose data=test_prob_days_since_stats
out=test_prob_days_since_stats_x3 (drop=_NAME_)
label=prior_day
prefix=med_days_prior;
by customer_id;
id prior_day;
var med_days_prior;
run;
proc transpose data=test_prob_days_since_stats
out=test_prob_days_since_stats_x4 (drop=_NAME_)
label=prior_day
prefix=mod_days_prior;
by customer_id;
id prior_day;
var mod_days_prior;
run;
proc transpose data=test_prob_days_since_stats
out=test_prob_days_since_stats_x5 (drop=_NAME_)
label=prior_day
prefix=stdev_wkday_prior;
by customer_id;
id prior_day;
var stdev_wkday_prior;
run;
data test_prob_days_since_stats_x;
merge test_prob_days_since_stats_x1
test_prob_days_since_stats_x2
test_prob_days_since_stats_x3
test_prob_days_since_stats_x4
test_prob_days_since_stats_x5
;
by customer_id;
run;
**********************************************************************;
**********************Create Master Set*******************************;
**********************************************************************;
*This set includes all the information and statistics to
apply the GBM regression results
to make projections on a customers next visit date;
data drive.test_master_set;
format customer_id last_date;
merge test_prob_days_since_stats_x
test_prob_day_x
test_prob_days_since_x
drive.test_last_spend
(drop= last_days_since sec:)
;
by customer_id;
drop _NAME_;
run;
*This set includes the training set, but then only keeps
the test set data.
This will create copies of all the variables in the
training set onto the test set;
data drive.Test_Master_set_v2;
set drive.train_master_set
drive.test_master_set (in=test)
;
if test;
drop visit_date visit_spend;
run;
*Output CSV file for R;
proc export data = drive.Test_Master_set_v2
outfile =
'C:\Dunnhumby\Winning Submission\test_master_set_R.csv'
DBMS = csv replace;
run;
*******************************************************;
*
*;
*
Kaggle - Dunnhumby's Shoppers Challenge
*;
*
*;
*
Written by Neil Schneider
*;
*
*;
*
Program develops projections for shoppers' next *;
*
spend amount.
*;
*
*;
*******************************************************;
libname Drive 'C:\Dunnhumby\Winning Submission\';
********************************************************;
****************Training Data***************************;
********************************************************;
data train_set_mod;
set drive.train_set_mod;
run;
********************************************************;
****************Modal range spend***********************;
********************************************************;
*Modal Range Spend is defined as the empirical range with the maximum number
of spends;
*In case of a tie the first range is chosen. (This errs towards smaller
spends);
*Cross every customer spend amount with every other spend amount
for that customer;
*Use triangle weighting to give more recent spend dates more
weight;
proc sql;
create table prob_cnt as
select
main.customer_id
,main.visit_date
,main.visit_spend
,sum(case
when sec.visit_spend ge main.visit_spend and
sec.visit_spend - 20 le main.visit_spend
then (sec.visit_date - '01APR2010'd)
else 0
end) as prob_cnt
from train_set_mod as main
inner join train_set_mod as sec on
main.customer_id = sec.customer_id
group by main.customer_id
,main.visit_date
,main.visit_spend
;
quit;
run;
*Find the maximum count of spends by customer;
proc summary nway missing data=prob_cnt;
class customer_id;
var prob_cnt;
output out=max_prob (drop=_TYPE_) max=max_prob;
run;
*Output the modal range for each customer;
data drive.train_modal_range
(drop = visit_date visit_spend prob_cnt output_flag
rename=_FREQ_=total_visits);
merge prob_cnt
max_prob;
by customer_id;
retain output_flag;
*Reset Output Flag;
if first.customer_id then output_flag = 0;
*Output first spend amount to equal the maximum amount;
if prob_cnt = max_prob and output_flag = 0 then do;
output_flag = 1;
*The bottom value of the range was used in developing
the maximum range counts, so the middle of the range
is +10;
modal_range = visit_spend + 10;
output;
end;
run;
***********************************************************;
***********Modal range spend by Day of Week****************;
***********************************************************;
*Same idea as above, only this time each customer has a modal range of spend
amounts for Sunday through Saturday.;
*Cross every customer spend amount with every other spend amount
for that customer for a particular day of the week.;
*Use triangle weighting to give more recent spend dates more
weight;
proc sql;
create table prob_daily_cnt as
select
main.customer_id
,main.day
,main.visit_date
,main.visit_spend
,sum(case
when sec.visit_spend ge main.visit_spend and
sec.visit_spend - 20 le main.visit_spend
then (sec.visit_date - '01APR2010'd)
else 0
end) as prob_cnt
from train_set_mod as main
inner join train_set_mod as sec on
main.customer_id = sec.customer_id and
main.day = sec.day
group by main.customer_id
,main.day
,main.visit_date
,main.visit_spend
;
quit;
run;
*Find the maximum count of spends by customer and day of the
week;
proc summary nway missing data=prob_daily_cnt;
class customer_id day;
var prob_cnt;
output out=max_prob (drop=_TYPE_) max=max_prob;
run;
*Output the modal range for each customer;
data modal_range_wkday
(drop = visit_date visit_spend prob_cnt output_flag);
merge prob_daily_cnt
max_prob;
by customer_id day;
retain output_flag;
*Reset Output Flag;
if first.day then output_flag = 0;
*Output first spend amount to equal the maximum amount;
if prob_cnt = max_prob and output_flag = 0 then do;
output_flag = 1;
*The bottom value of the range was used in developing
the maximum range counts, so the middle of the range
is +10;
modal_range = visit_spend + 10;
output;
end;
run;
*Transpose the day of week variable into columns;
proc transpose data=modal_range_wkday
out=drive.train_modal_range_wkday
label=day
prefix=Modal_range_;
by customer_id;
id day;
var modal_range;
run;
********************************************************;
**********Number of Visits during entire year***********;
********************************************************;
proc summary nway missing data=train_set_mod;
class customer_id day;
output out=drive.train_entire_visits (drop=_TYPE_
rename=_FREQ_=visit_num_entire);
run;
proc transpose data=drive.train_entire_visits
out=drive.train_entire_visit_wkday
label=day
prefix= visit_num_entire_;
by customer_id;
id day;
var visit_num_entire;
run;
*Find the answers;
data drive.train_forecast_spend (drop=_NAME_);
merge drive.train_last_spend
drive.train_modal_range
drive.train_modal_range_wkday
drive.train_entire_visit_wkday
drive.train_set_results_01apr2011 (in=results);
by customer_id;
if results;
run;
********************************************************;
******************Test Data*****************************;
********************************************************;
data test_set_mod;
set drive.test_set_mod;
run;
********************************************************;
****************Modal range spend***********************;
********************************************************;
*Modal Range Spend is defined as the empirical range with the maximum number
of spends;
*In case of a tie the first range is chosen. (This errs towards smaller
spends);
*Cross every customer spend amount with every other spend amount
for that customer;
*Use triangle weighting to give more recent spend dates more
weight;
proc sql;
create table prob_cnt as
select
main.customer_id
,main.visit_date
,main.visit_spend
,sum(case
when sec.visit_spend ge main.visit_spend and
sec.visit_spend - 20 le main.visit_spend
then 1
else 0
end) as prob_cnt
from test_set_mod as main
inner join test_set_mod as sec on
main.customer_id = sec.customer_id
group by main.customer_id
,main.visit_date
,main.visit_spend
;
quit;
run;
*Find the maximum count of spends by customer;
proc summary nway missing data=prob_cnt;
class customer_id;
var prob_cnt;
output out=max_prob (drop=_TYPE_) max=max_prob;
run;
*Output the modal range for each customer;
data drive.test_modal_range
(drop = visit_date visit_spend prob_cnt output_flag
rename=_FREQ_=total_visits);
merge prob_cnt
max_prob;
by customer_id;
retain output_flag;
*Reset Output Flag;
if first.customer_id then output_flag = 0;
*Output first spend amount to equal the maximum amount;
if prob_cnt = max_prob and output_flag = 0 then do;
output_flag = 1;
*The bottom value of the range was used in developing
the maximum range counts, so the middle of the range
is +10;
modal_range = visit_spend + 10;
output;
end;
run;
***********************************************************;
***********Modal range spend by Day of Week****************;
***********************************************************;
*Same idea as above, only this time each customer has a modal range of spend
amounts for Sunday through Saturday.;
*Cross every customer spend amount with every other spend amount
for that customer for a particular day of the week.;
*Use triangle weighting to give more recent spend dates more
weight;
proc sql;
create table prob_daily_cnt as
select
main.customer_id
,main.day
,main.visit_date
,main.visit_spend
,sum(case
when sec.visit_spend ge main.visit_spend and
sec.visit_spend - 20 le main.visit_spend
then (sec.visit_date - '01APR2010'd)
else 0
end) as prob_cnt
from test_set_mod as main
inner join test_set_mod as sec on
main.customer_id = sec.customer_id and
main.day = sec.day
group by main.customer_id
,main.day
,main.visit_date
,main.visit_spend
;
quit;
run;
*Find the maximum count of spends by customer and day of the
week;
proc summary nway missing data=prob_daily_cnt;
class customer_id day;
var prob_cnt;
output out=max_prob (drop=_TYPE_) max=max_prob;
run;
*Output the modal range for each customer;
data modal_range_wkday
(drop = visit_date visit_spend prob_cnt output_flag);
merge prob_daily_cnt
max_prob;
by customer_id day;
retain output_flag;
*Reset Output Flag;
if first.day then output_flag = 0;
*Output first spend amount to equal the maximum amount;
if prob_cnt = max_prob and output_flag = 0 then do;
output_flag = 1;
*The bottom value of the range was used in developing
the maximum range counts, so the middle of the range
is +10;
modal_range = visit_spend + 10;
output;
end;
run;
*Transpose the day of week variable into columns;
proc transpose data=modal_range_wkday
out=drive.test_modal_range_wkday
label=day
prefix=Modal_range_;
by customer_id;
id day;
var modal_range;
run;
********************************************************;
**********Number of Visits during entire year***********;
********************************************************;
proc summary nway missing data=test_set_mod;
class customer_id day;
output out=drive.test_entire_visits (drop=_TYPE_
rename=_FREQ_=visit_num_entire);
run;
proc transpose data=drive.test_entire_visits
out=drive.test_entire_visit_wkday
label=day
prefix= visit_num_entire_;
by customer_id;
id day;
var visit_num_entire;
run;
*Find the answers;
data drive.test_forecast_spend (drop=_NAME_);
merge drive.test_last_spend
drive.test_modal_range
drive.test_modal_range_wkday
drive.test_entire_visit_wkday;
by customer_id;
run;
# *********************************************************;
#*
*;
#*
Kaggle - Dunnhumby's Shoppers Challenge
*;
#*
*;
#*
Written by Neil Schneider
*;
#*
*;
#*
Program uses loaded GBM results for each modeled
*;
#*
day to create scores used in JMP.
*;
#*
*;
# *********************************************************;
#Load R Workspace
load("C:/Dunnhumby/Winning Submission/GBM_results_R_workspace.RData")
#Load Libraries
#install.packages("gbm")
library("gbm")
#Load training and test date sets created in SAS
train_date_master <- as.matrix(read.csv("C:/Dunnhumby/Winning
Submission/train_master_set_R.csv"))
test_date_master <- as.matrix(read.csv("C:/Dunnhumby/Winning Submission/test_master_set_R.csv"))
train_date_results <- as.matrix(read.csv("C:/Dunnhumby/Winning Submission/train_date_results.csv"))
#Set Nulls and NA to 0
#Turn data into numerics. The Date field will become NAs but that is ok.
train_date_master[is.na(train_date_master)]<-0
train_date_master <- apply(train_date_master, 2, as.numeric)
test_date_master[is.na(test_date_master)]<-0
test_date_master <- apply(test_date_master, 2, as.numeric)
#Modify results and combine with master set
train_date_results[train_date_results==-1]<-0
train_date_results <- apply(train_date_results, 2, as.numeric)
train_date_master <- cbind(train_date_master[,1:110],train_date_results[,3:19])
colnames(train_date_master)[111:127] <- colnames(train_date_results[,3:19])
#Sample GBM code
#Models were run for each day up to Apr09. That cutoff was choosen based on time available to the
end of contest
#and the number of actual returns drops to low levels after the 9th.
#All GBM code can be recreated from loaded workspace with GBM class variables
# gbm1 <- gbm(apr_01~.,data=as.data.frame(train_date_master[1:99952,c(3:108,111)]),
#
#offset = NULL,
#
#misc = NULL,
#
distribution = "bernoulli",
#
#w = NULL,
#
#
#
#
#
#
#
#
#
#
#
#
#
#var.monotone = NULL,
n.trees = 50000,
interaction.depth = 3,
n.minobsinnode = 10,
shrinkage = 0.001,
bag.fraction = 0.5,
train.fraction = 1.0,
cv.folds = 5,
keep.data = FALSE,
verbose = TRUE,
#var.names = NULL,
#response.name = NULL
)
#cross-validation method is used to get the best iteration number
#predictions are made for both the training and test sets
best.iter.1 <- gbm.perf(gbm1,method="cv")
gbm.predict.1 <as.matrix(predict.gbm(gbm1,as.data.frame(train_date_master[1:99952,c(3:108,111)]),best.iter.1))
gbm.predict.test.1 <as.matrix(predict.gbm(gbm1,as.data.frame(test_date_master[1:10000,3:108]),best.iter.1))
best.iter.2 <- gbm.perf(gbm2,method="cv")
gbm.predict.2 <as.matrix(predict.gbm(gbm2,as.data.frame(train_date_master[1:99952,c(3:108,112)]),best.iter.2))
gbm.predict.test.2 <as.matrix(predict.gbm(gbm2,as.data.frame(test_date_master[1:10000,3:108]),best.iter.2))
best.iter.3 <- gbm.perf(gbm3,method="cv")
gbm.predict.3 <as.matrix(predict.gbm(gbm3,as.data.frame(train_date_master[1:99952,c(3:108,113)]),best.iter.3))
gbm.predict.test.3 <as.matrix(predict.gbm(gbm3,as.data.frame(test_date_master[1:10000,3:108]),best.iter.3))
best.iter.4 <- gbm.perf(gbm4,method="cv")
gbm.predict.4 <as.matrix(predict.gbm(gbm4,as.data.frame(train_date_master[1:99952,c(3:108,114)]),best.iter.4))
gbm.predict.test.4 <as.matrix(predict.gbm(gbm4,as.data.frame(test_date_master[1:10000,3:108]),best.iter.4))
best.iter.5 <- gbm.perf(gbm5,method="cv")
gbm.predict.5 <as.matrix(predict.gbm(gbm5,as.data.frame(train_date_master[1:99952,c(3:108,115)]),best.iter.5))
gbm.predict.test.5 <as.matrix(predict.gbm(gbm5,as.data.frame(test_date_master[1:10000,3:108]),best.iter.5))
best.iter.6 <- gbm.perf(gbm6,method="cv")
gbm.predict.6 <as.matrix(predict.gbm(gbm6,as.data.frame(train_date_master[1:99952,c(3:108,116)]),best.iter.6))
gbm.predict.test.6 <as.matrix(predict.gbm(gbm6,as.data.frame(test_date_master[1:10000,3:108]),best.iter.6))
best.iter.7 <- gbm.perf(gbm7,method="cv")
gbm.predict.7 <as.matrix(predict.gbm(gbm7,as.data.frame(train_date_master[1:99952,c(3:108,117)]),best.iter.7))
gbm.predict.test.7 <as.matrix(predict.gbm(gbm7,as.data.frame(test_date_master[1:10000,3:108]),best.iter.7))
best.iter.8 <- gbm.perf(gbm8,method="cv")
gbm.predict.8 <as.matrix(predict.gbm(gbm8,as.data.frame(train_date_master[1:99952,c(3:108,118)]),best.iter.8))
gbm.predict.test.8 <as.matrix(predict.gbm(gbm8,as.data.frame(test_date_master[1:10000,3:108]),best.iter.8))
best.iter.9 <- gbm.perf(gbm9,method="cv")
gbm.predict.9 <as.matrix(predict.gbm(gbm9,as.data.frame(train_date_master[1:99952,c(3:108,119)]),best.iter.9))
gbm.predict.test.9 <as.matrix(predict.gbm(gbm9,as.data.frame(test_date_master[1:10000,3:108]),best.iter.9))
#create individual vectors for each date
apr01 <- (train_date_master[1:99952,111])
apr02 <- (train_date_master[1:99952,112])
apr03 <- (train_date_master[1:99952,113])
apr04 <- (train_date_master[1:99952,114])
apr05 <- (train_date_master[1:99952,115])
apr06 <- (train_date_master[1:99952,116])
apr07 <- (train_date_master[1:99952,117])
apr08 <- (train_date_master[1:99952,118])
apr09 <- (train_date_master[1:99952,119])
#combine training and test predictions with actual date flags
gbm.predict.train <cbind(train_date_master[,1],gbm.predict.1,gbm.predict.2,gbm.predict.3,gbm.predict.4,gbm.predict.5,
gbm.predict.6,gbm.predict.7,gbm.predict.8,gbm.predict.9,
apr01,apr02,apr03,apr04,apr05,apr06,apr07,apr08,apr09)
#Apr data information from the first 10000 records in the training set are used as place holders for the
variables
gbm.predict.test <cbind(test_date_master[,1],gbm.predict.test.1,gbm.predict.test.2,gbm.predict.test.3,gbm.predict.test.4,
gbm.predict.test.5,
gbm.predict.test.6,gbm.predict.test.7,gbm.predict.test.8,gbm.predict.test.9,
apr01[1:10000],apr02[1:10000],apr03[1:10000],apr04[1:10000],apr05[1:10000],
apr06[1:10000],apr07[1:10000],apr08[1:10000],apr09[1:10000])
#The wrongly mapped on date flag are all set to 0
gbm.predict.test[,11:18] <- 0
#Stack the sets and output them into a csv for JMP.
gbm.predict.all <- rbind(gbm.predict.train,gbm.predict.test)
write.csv(gbm.predict.all,"C:/Dunnhumby/Winning Submission/Apr01_Apr09_JMP.csv")
*******************************************************;
*
*;
*
Kaggle - Dunnhumby's Shoppers Challenge
*;
*
*;
*
Written by Neil Schneider
*;
*
*;
*
Program combines date and spend projections to *;
*
create the final submission.
*;
*
*;
*******************************************************;
*Bring in results from R;
PROC IMPORT OUT= drive.Results_from_R
DATAFILE=
"C:\Dunnhumby\Winning Submission\Apr01_Apr09_JMP.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
GUESSINGROWS=50;
RUN;
data max_score_date;
set drive.results_from_R (drop= VAR1 rename=(VAR2=customer_id));
max_score = max(var3,var4,var5,var6,var7,var8,var9,var10,var11);
format forecast_date yymmdd10.;
select (max_score);
when (var3) forecast_date = '01APR2011'd;
when (var4) forecast_date = '02APR2011'd;
when (var5) forecast_date = '03APR2011'd;
when (var6) forecast_date = '04APR2011'd;
when (var7) forecast_date = '05APR2011'd;
when (var8) forecast_date = '06APR2011'd;
when (var9) forecast_date = '07APR2011'd;
when (var10) forecast_date = '08APR2011'd;
when (var11) forecast_date = '09APR2011'd;
otherwise forecast_date = "Error";
end;
run;
*repeating spend GBM max score;
Data drive.final_sumbission (keep= customer_id visit_date visit_spend);
merge drive.test_forecast_spend
max_score_date
(keep=customer_id forecast_date firstobs=99953);
by customer_id;
format visit_date mmddyy10.;
visit_date = forecast_date;
select (weekday(visit_date));
when (1) do;
if modal_range_1 = . or
visit_num_entire_1 lt 15 then
visit_spend = modal_range;
else visit_spend = modal_range_1;
end;
when (2) do;
if modal_range_2 = . or
visit_num_entire_2 lt 17 then
visit_spend = modal_range;
else visit_spend = modal_range_2;
end;
when (3) do;
if modal_range_3 = . or
visit_num_entire_3 lt 22 then
visit_spend = modal_range;
else visit_spend = modal_range_3;
end;
when (4) do;
if modal_range_4 = . or
visit_num_entire_4 lt 19 then
visit_spend = modal_range;
else visit_spend = modal_range_4;
end;
when (5) do;
if modal_range_5 = . or
visit_num_entire_5 lt 16 then
visit_spend = modal_range;
else visit_spend = modal_range_5;
end;
when (6) do;
if modal_range_6 = . or
visit_num_entire_6 lt 26 then
visit_spend = modal_range;
else visit_spend = modal_range_6;
end;
when (7) do;
if modal_range_7 = . or
visit_num_entire_7 lt 21 then
visit_spend = modal_range;
else visit_spend = modal_range_7;
end;
Otherwise visit_spend = modal_range;
end;
run;
*Output CSV file for R;
proc export data = drive.final_sumbission
outfile =
'C:\Dunnhumby\Winning Submission\winning_submission.csv'
DBMS = csv replace;
run;
Download