Training Data

advertisement
Using R
to win
Kaggle Data Mining
Competitions
Chris Raimondi
November 1, 2012
Overview of talk
•
•
•
•
What I hope you get out of this talk
Life before R
Simple model example
R programming language
• Background/Stats/Info
• How to get started
• Kaggle
Overview of talk
• Individual Kaggle competitions
• HIV Progression
• Chess
• Mapping Dark Matter
• Dunnhumby’s Shoppers Challenge
• Online Product Sales
What I want you to leave with
• Belief that you don’t need to be a
statistician to use R - NOR do you
need to fully understand Machine
Learning in order to use it
• Motivation to use Kaggle
competitions to learn R
• Knowledge on how to start
My life before R
• Lots of Excel
• Had tried programming in the past –
got frustrated
• Read NY Times article in January
2009 about R & Google
• Installed R, but gave up after a
couple minutes
• Months later…
My life before R
• Using Excel to run PageRank
calculations that took hours and was
very messy
• Was experimenting with Pajek – a
windows based Network/Link
analysis program
• Was looking for a similar program
that did PageRank calculations
• Revisited R as a possibility
My life before R
• Came across “R Graph Gallery”
• Saw this graph…
Addicted to R in one line of code
pairs(iris[1:4], main="Edgar Anderson's Iris Data", pch=21,
bg=c("red", "green3", "blue")[unclass(iris$Species)])
“pairs” = function
“iris” = dataframe
What do we want to do
with R?
• Machine learning
a.k.a. – or more specifically
• Making models
We want to TRAIN a set of data with
KNOWN answers/outcomes
In order to PREDICT the
answer/outcome to similar data where
the answer is not known
How to train a model
R allows for the training of models using probably
over 100 different machine learning methods
To train a model you need to provide
1. Name of the function – which machine learning
method
2. Name of Dataset
3. What is your response variable and what
features are you going to use
Example machine learning methods
available in R
Bagging
Boosted Trees
Elastic Net
Gaussian Processes
Generalized additive model
Generalized linear model
K Nearest Neighbor
Linear Regression
Nearest Shrunken Centroids
Neural Networks
Partial Least Squares
Principal Component Regression
Projection Pursuit Regression
Quadratic Discriminant Analysis
Random Forests
Recursive Partitioning
Rule-Based Models
Self-Organizing Maps
Sparse Linear Discriminant Analysis
Support Vector Machines
Code used to train decision tree
library(party)
irisct <- ctree(Species ~ Sepal.Length +
Sepal.Width + Petal.Length + Petal.Width,
data = iris)
Or use “.” to mean everything else - as in…
irisct <- ctree(Species ~ ., data = iris)
That’s it
You’ve trained your model – to make predictions with it
– use the “predict” function – like so:
my.prediction <- predict(irisct, iris2)
To see a graphic representation of it – use “plot”.
plot(irisct)
plot(irisct, tp_args = list(fill =
c("red", "green3", "blue")))
R background
• Statistical Programming Language
• Since 1996
• Powerful – used by companies like
Google, Allstate, and Pfizer.
• Over 4,000 packages available on
CRAN
• Free
• Available for Linux, Mac, and
Windows
Learn R – Starting Tonight
•
•
•
•
Buy “R in a Nutshell”
Download and Install R
Download and Install Rstudio
Watch 2.5 minute video on
front page of rstudio.com
• Use read.csv to read a Kaggle
data set into R
Learn R – Continue
Tomorrow
• Train a model using Kaggle data
• Make a prediction using that
model
• Submit the prediction to Kaggle
Learn R – This Weekend
• Install the Caret package
• Start reading the four Caret
vignettes
• Use the “train” function in Caret
to train a model, select a
parameter, and make a
prediction with this model
Buy This Book: R in a
Nutshell
• Excellent Reference
• 2nd Edition released
just two weeks ago
• In stock at Amazon
for $37.05
• Extensive chapter on
machine learning
R Studio
R Tip
Read the vignettes – some of them are golden.
There is a correlation between the quality of an
R package and its associated vignette.
What is kaggle?
• Platform/website for predictive
modeling competitions
• Think middleman – they provide
the tools for anyone to host a data
mining competition
• Makes it easy for competitors as
well – they know where to go to
find the data/competitions
• Community/forum to find
teammates
Kaggle Stats
• Competitions started over 2 years
ago
• 55+ different competitions
• Over 60,000 Competitors
• 165,000+ Entries
• Over $500,000 in prizes awarded
Why Use Kaggle?
•
•
•
•
•
Rich Diverse Set of Competitions
Real World Data
Competition = Motivation
Fame
Fortune
Who has Hosted on Kaggle?
Methods used by competitors
source:kaggle.com
Predict HIV Progression
Prizes:
1st $500.00
Objective:
Predict (yes/no) if there will be an
improvement in a patient's HIV viral load.
Training Data:
1,000 Patients
Testing Data:
692 Patients
Answer
Various Features
Test
Training
Response PR Seq
1 CCTCAGATCA
1 CACTCTAAAT
0 AAGAAATCTG
0 AAGAAATCTG
0 AAGAAATCTG
0 CACTCTAAAT
0 AAGAAATCTG
0 CACTTTAAAT
0 AAGAAATCTG
1 TGGAAGAAAT
1 TTCGTCACAA
0 AAGAGATCTG
0 ACTAAATTTT
0 CCTCAAATCA
N/A 1 CCTCAGATCA
N/A 0 ATTAAATTTT
N/A 0 ATTAAATTTT
N/A 1 CCTCAGATCA
N/A 0 CCTCAAATCA
N/A 1 AAGGAATCTG
N/A 0 AAGAAATCTG
N/A 0 CACTTTAAAT
N/A 0 AAGAAATCTG
N/A 1 TGGAAGAAAT
RT Seq
VL-t0 CD4-t0
TACCTTAAAT
4.7
473
CTTAAATTTY
5.0
7
CCTCAGATCA
3.2
349
CTCTTTGGCA
5.1
51
GAGAGATCTG
3.7
77
CTTAAATTTY
5.7
206
TCTAAATTTC
3.9
144
TCTAAACTTT
4.4
496
CTCTTTGGCA
3.4
252
CTCTTTGGCA
5.5
7
CTCTTTGGCA
4.3
109
CTCTTTGGCA
5.0
70
CTCTTTGGCA
5.0
570
CTCTTTGGCA
4.0
217
TCTAAATTTC
2.8
730
CTCTTTGGCA
4.5
56
TACTTTAAAT
5.1
21
CTCTTTGGCA
5.5
249
CTTAAATTTT
4.0
269
CCTCAGATCA
4.6
165
TCTAAATTTC
3.9
144
TCTAAACTTT
4.4
496
CTCTTTGGCA
3.4
252
CTCTTTGGCA
5.5
91
Training
Set
Public
Leaderboard
Private
Leaderboard
Predict HIV Progression
Predict HIV Progression
Features Provided:
1.PR: 297 letters long – or N/A
2.RT: 193 – 494 letters long
3.CD4: Numeric
4.VLt0: Numeric
Features Used:
1.PR1-PR97: Factor
2.RT1-RT435: Factor
3.CD4: Numeric
4.VLt0: Numeric
Predict HIV Progression
Concepts / Packages:
• Caret
• train
• rfe
• randomForest
Random Forest
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.1
3.5
1.4
0.2
4.9
3
1.4
0.2
4.7
3.2
1.3
0.2
4.6
3.1
1.5
0.2
5
3.6
1.4
0.2
5.4
3.9
1.7
0.4
4.6
3.4
1.4
0.3
5
3.4
1.5
0.2
4.4
2.9
1.4
0.2
4.9
3.1
1.5
0.1
5.4
3.7
1.5
0.2
4.8
3.4
1.6
0.2
4.8
3
1.4
0.1
4.3
3
1.1
0.1
5.8
4
1.2
0.2
5.7
4.4
1.5
0.4
5.4
3.9
1.3
0.4
5.1
3.5
1.4
0.3
5.7
3.8
1.7
0.3
5.1
3.8
1.5
0.3
Tree 1:
Take a random ~ 63.2% sample of
rows from the data set
For each node – take mtry random
features – in this case 2 would be
the default
Tree 2:
Take a different random ~ 63.2%
sample of rows from the data set
And so on…..
Caret – train
TrainData <- iris[,1:4]
TrainClasses <- iris[,5]
knnFit1 <- train(TrainData, TrainClasses,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 3,
trControl = trainControl(method = "cv",
number=10))
Caret – train
> knnFit1
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
Pre-processing: centered, scaled
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 135, 135, 135, 135,
135, 135, ...
Resampling results across tuning parameters:
Caret – train
k
Accuracy
5
0.94
7
0.967
9
0.953
11 0.953
13 0.967
15 0.967
17 0.973
19 0.96
21 0.96
23 0.947
Kappa Accuracy SD
0.91
0.0663
0.95
0.0648
0.93
0.0632
0.93
0.0632
0.95
0.0648
0.95
0.0648
0.96
0.0644
0.94
0.0644
0.94
0.0644
0.92
0.0613
Kappa SD
0.0994
0.0972
0.0949
0.0949
0.0972
0.0972
0.0966
0.0966
0.0966
0.0919
Accuracy was used to select the optimal model using
largest value.
The final value used for the model was k = 17.
the
Benefits of winning
•
•
•
•
•
Cold hard cash
Several newspaper articles
Quoted in Science magazine
Prestige
Easier to find people willing to
team up
• Asked to speak at STScI
• Perverse pleasure in telling
people the team that came in
second worked at….
IBM Thomas J. Watson
Research Center
Chess Ratings Comp
Prizes:
1st $10,000.00
Objective:
Given 100 months of data predict game
outcomes for months 101 – 105.
Training Data Provided:
1.
2.
3.
4.
Month
White Player #
Black Player #
White Outcome – Win/Draw/Lose
(1/0.5/0)
How do I convert the data
into a flat 2D
representation?
Think:
1. What are you trying to
predict?
2. What Features will you
use?
Game Feature 2
Game Feature 1
White/Black 4
White/Black 3
White/Black 2
White/Black 1
Black Feature 4
Type of Game Played
White Games Played/Black Games Played
Number of Games Played
1
Black Feature 3
Number of Games won as White
0.5
Black Feature 2
Percentage of Games Won
1
Black Feature 1
0
White Feature 4
0
Number of Games Played
1
White Feature 3
1
Percentage of Games Won
0
Number of Games won as White
0.5
White Feature 2
White Feature 1
Outcome
1
Packages/Concepts Used:
1. igraph
2. 1st real function
Mapping Dark Matter
Prizes:
1st ~$3,000.00
The prize will be an expenses paid trip to the Jet
Propulsion Laboratory (JPL) in Pasadena, California
to attend the GREAT10 challenge workshop "Image
Analysis for Cosmology".
Objective:
“Participants are provided with 100,000
galaxy and star pairs. A participant should
provide an estimate for the ellipticity for
each galaxy.”
dunnhumby's Shopper
Challenge
Prizes:
1st $6,000.00
2nd $3,000.00
3rd $1,000.00
Objective:
• Predict the next date that the
customer will make a purchase
AND
• Predict the amount of the
purchase to within £10.00
Data Provided
For 100,000 customers:
April 1, 2010 – June 19, 2011
1. customer_id
2. visit_date
3. visit_spend
For 10,000 customers:
April 1, 2010 – March 31, 2011
1. customer_id
2. visit_date
3. visit_spend
Really two different
challenges:
1) Predict next purchase date
Max of ~42.73% obtained
2) Predict purchase amount to
within £10.00
Max of ~38.99% obtained
If independent 42.73% * 38.99% =
16.66%
In reality – max obtained was
18.83%
dunnhumby's Shopper
Challenge
Packages Used &
Concepts Explored:
1st competition with real dates
• zoo
• arima
• forecast
SVD
• svd
• irlba
SVD
Singular value decomposition
807
x
1209
Row 2
Row 3
Row 4
Row N
…
Col N
4
…
D
NxN
X
N
Row Features
Column Features
Row 1
3
Col 4
X
2
Col 3
1
Col 2
=
Col 1
Original
Matrix
1st Most Important
2nd Most Important
3rd Most Important
4th Most Important
...
Nth Most Important
U
V
T
1st N x N
2nd
3rd
4th
...
Nth
U
D
V
1st
1
807
x
1209
~
1st Most Important
X
Original
Matrix
T
X
x <- read.jpeg("test.image.2.jpg")
im <- imagematrix(x, type = "grey")
im.svd <- svd(im)
u <- im.svd$u
d <- diag(im.svd$d)
v <- im.svd$v
U
D
V
1st
1
807
x
1209
~
1st Most Important
X
Original
Matrix
T
X
new.u <- as.matrix(u[, 1:1])
new.d <- as.matrix(d[1:1, 1:1])
new.v <- as.matrix(v[, 1:1])
new.mat <- new.u %*% new.d %*% t(new.v)
new.im <- imagematrix(new.mat, type =
"grey")
plot(new.im, useRaster = TRUE)
U
D
V
1st
1
807
x
1209
~
1st Most Important
X
Original
Matrix
X
T
U
D
1
807
x
1209
~
1st Most Important
2nd Most Important
X
Original
Matrix
V
1st
2nd
2
X
T
Original
Matrix
807
x
1209
~
1st Most Important
2nd Most Important
3rd Most Important
U
1
X
2
D
V
3
1st
2nd
3rd
X
T
Original
Matrix
807
x
1209
~
1st Most Important
2nd Most Important
3rd Most Important
4th Most Important
U
1
X
2
D
V
3
1st
2nd
3rd
4th
4
X
T
Original
Matrix
807
x
1209
~
1st Most Important
2nd Most Important
3rd Most Important
4th Most Important
5th Most Important
U
1
X
2
D
V
3
1st
2nd
3rd
4th
5th
4
5
X
T
Original
Matrix
807
x
1209
~
1st Most Important
2nd Most Important
3rd Most Important
4th Most Important
5th Most Important
6th Most Important
U
D
1
X
2
3
4
V
5
X
6
1st
2nd
3rd
4th
5th
6th
T
Original
Matrix
807
x
1209
~
1st Most Important
2nd Most Important
3rd Most Important
4th Most Important
…
807th Most Important
U
D
1
X
2
3
4
V
.
X
8
0
7
T
1st
2nd
3rd
4th
…
807th
100,000
x
365
Cust 2
Cust 3
Cust 4
Cust 5
365x365
1st
2nd
3rd
4th
...
Nth
…
Day N
V
X
N
Day Features
Cust 1
D
4
…
=
3
Day 4
X
2
Day 3
1
Day 2
100,000
x
365
Day 1
Original
Matrix
1st Most Important
2nd Most Important
3rd Most Important
4th Most Important
...
Nth Most Important
U
Customer Features
T
365
x
365
D
1
Original
Matrix
100,000
x
365
2
365x365
3
4
…
N
Original
Matrix
100,000
x
365
=
1st Most Important
U[,1] = 100,000 x 1
V
1st
T
= 365 x 1 =
V
1st
T
= 365 x 1 [first 28 shown]=
V
2nd
T
= 365 x 1 [first 28 shown]=
V
3rd
T
= 365 x 1 [first 28 shown]=
V
4th
T
= 365 x 1 [first 28 shown]=
V
5th
T
= 365 x 1 [first 28 shown]=
V
6th
T
= 365 x 1 [first 28 shown]=
V
7th
T
= 365 x 1 [first 28 shown]=
V
8th
T
= 365 x 1 [all 365 shown]=
Online Product Sales
Prizes:
1st $15,000.00
2nd $ 5,000.00
3rd $ 2,500.00
Objective:
“[P]redict monthly online sales of a
product. Imagine the products are
online self-help programs following an
initial advertising campaign.”
Online Product Sales
Packages/Concepts Explored:
1. Data analysis – looking at
data closely
2. gbm
3. Teams
Online Product Sales
Looking at data closely
... 6532 6532 6661 6661 7696 7701 7701
8229 8412 8895 9596 9596 9772 9772 ...
Cat_1=0 Cat_1=1
6274
1
1
6532
1
1
6661
1
1
7696
0
1
7701
1
1
8229
1
0
8412
1
0
8895
1
0
9596
1
1
9772
1
1
Online Product Sales
On the public leaderboard:
Online Product Sales
On the private leaderboard:
Thank You!
Questions?
Extra Slides
R Code for Dunnhumby
Time Series
U
D
X
4 X 150
=
4 X 150
4X4
V
X
T
4X4
> my.svd <- svd(iris[,1:4])
> objects(my.svd)
[1] "d" "u" "v"
> my.svd$d
[1] 95.959914 17.761034 3.460931
1.884826
> dim(my.svd$u)
[1] 150 4
> dim(my.svd$v)
[1] 4 4
Download