Using R to win Kaggle Data Mining Competitions Chris Raimondi November 1, 2012 Overview of talk • • • • What I hope you get out of this talk Life before R Simple model example R programming language • Background/Stats/Info • How to get started • Kaggle Overview of talk • Individual Kaggle competitions • HIV Progression • Chess • Mapping Dark Matter • Dunnhumby’s Shoppers Challenge • Online Product Sales What I want you to leave with • Belief that you don’t need to be a statistician to use R - NOR do you need to fully understand Machine Learning in order to use it • Motivation to use Kaggle competitions to learn R • Knowledge on how to start My life before R • Lots of Excel • Had tried programming in the past – got frustrated • Read NY Times article in January 2009 about R & Google • Installed R, but gave up after a couple minutes • Months later… My life before R • Using Excel to run PageRank calculations that took hours and was very messy • Was experimenting with Pajek – a windows based Network/Link analysis program • Was looking for a similar program that did PageRank calculations • Revisited R as a possibility My life before R • Came across “R Graph Gallery” • Saw this graph… Addicted to R in one line of code pairs(iris[1:4], main="Edgar Anderson's Iris Data", pch=21, bg=c("red", "green3", "blue")[unclass(iris$Species)]) “pairs” = function “iris” = dataframe What do we want to do with R? • Machine learning a.k.a. – or more specifically • Making models We want to TRAIN a set of data with KNOWN answers/outcomes In order to PREDICT the answer/outcome to similar data where the answer is not known How to train a model R allows for the training of models using probably over 100 different machine learning methods To train a model you need to provide 1. Name of the function – which machine learning method 2. Name of Dataset 3. What is your response variable and what features are you going to use Example machine learning methods available in R Bagging Boosted Trees Elastic Net Gaussian Processes Generalized additive model Generalized linear model K Nearest Neighbor Linear Regression Nearest Shrunken Centroids Neural Networks Partial Least Squares Principal Component Regression Projection Pursuit Regression Quadratic Discriminant Analysis Random Forests Recursive Partitioning Rule-Based Models Self-Organizing Maps Sparse Linear Discriminant Analysis Support Vector Machines Code used to train decision tree library(party) irisct <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris) Or use “.” to mean everything else - as in… irisct <- ctree(Species ~ ., data = iris) That’s it You’ve trained your model – to make predictions with it – use the “predict” function – like so: my.prediction <- predict(irisct, iris2) To see a graphic representation of it – use “plot”. plot(irisct) plot(irisct, tp_args = list(fill = c("red", "green3", "blue"))) R background • Statistical Programming Language • Since 1996 • Powerful – used by companies like Google, Allstate, and Pfizer. • Over 4,000 packages available on CRAN • Free • Available for Linux, Mac, and Windows Learn R – Starting Tonight • • • • Buy “R in a Nutshell” Download and Install R Download and Install Rstudio Watch 2.5 minute video on front page of rstudio.com • Use read.csv to read a Kaggle data set into R Learn R – Continue Tomorrow • Train a model using Kaggle data • Make a prediction using that model • Submit the prediction to Kaggle Learn R – This Weekend • Install the Caret package • Start reading the four Caret vignettes • Use the “train” function in Caret to train a model, select a parameter, and make a prediction with this model Buy This Book: R in a Nutshell • Excellent Reference • 2nd Edition released just two weeks ago • In stock at Amazon for $37.05 • Extensive chapter on machine learning R Studio R Tip Read the vignettes – some of them are golden. There is a correlation between the quality of an R package and its associated vignette. What is kaggle? • Platform/website for predictive modeling competitions • Think middleman – they provide the tools for anyone to host a data mining competition • Makes it easy for competitors as well – they know where to go to find the data/competitions • Community/forum to find teammates Kaggle Stats • Competitions started over 2 years ago • 55+ different competitions • Over 60,000 Competitors • 165,000+ Entries • Over $500,000 in prizes awarded Why Use Kaggle? • • • • • Rich Diverse Set of Competitions Real World Data Competition = Motivation Fame Fortune Who has Hosted on Kaggle? Methods used by competitors source:kaggle.com Predict HIV Progression Prizes: 1st $500.00 Objective: Predict (yes/no) if there will be an improvement in a patient's HIV viral load. Training Data: 1,000 Patients Testing Data: 692 Patients Answer Various Features Test Training Response PR Seq 1 CCTCAGATCA 1 CACTCTAAAT 0 AAGAAATCTG 0 AAGAAATCTG 0 AAGAAATCTG 0 CACTCTAAAT 0 AAGAAATCTG 0 CACTTTAAAT 0 AAGAAATCTG 1 TGGAAGAAAT 1 TTCGTCACAA 0 AAGAGATCTG 0 ACTAAATTTT 0 CCTCAAATCA N/A 1 CCTCAGATCA N/A 0 ATTAAATTTT N/A 0 ATTAAATTTT N/A 1 CCTCAGATCA N/A 0 CCTCAAATCA N/A 1 AAGGAATCTG N/A 0 AAGAAATCTG N/A 0 CACTTTAAAT N/A 0 AAGAAATCTG N/A 1 TGGAAGAAAT RT Seq VL-t0 CD4-t0 TACCTTAAAT 4.7 473 CTTAAATTTY 5.0 7 CCTCAGATCA 3.2 349 CTCTTTGGCA 5.1 51 GAGAGATCTG 3.7 77 CTTAAATTTY 5.7 206 TCTAAATTTC 3.9 144 TCTAAACTTT 4.4 496 CTCTTTGGCA 3.4 252 CTCTTTGGCA 5.5 7 CTCTTTGGCA 4.3 109 CTCTTTGGCA 5.0 70 CTCTTTGGCA 5.0 570 CTCTTTGGCA 4.0 217 TCTAAATTTC 2.8 730 CTCTTTGGCA 4.5 56 TACTTTAAAT 5.1 21 CTCTTTGGCA 5.5 249 CTTAAATTTT 4.0 269 CCTCAGATCA 4.6 165 TCTAAATTTC 3.9 144 TCTAAACTTT 4.4 496 CTCTTTGGCA 3.4 252 CTCTTTGGCA 5.5 91 Training Set Public Leaderboard Private Leaderboard Predict HIV Progression Predict HIV Progression Features Provided: 1.PR: 297 letters long – or N/A 2.RT: 193 – 494 letters long 3.CD4: Numeric 4.VLt0: Numeric Features Used: 1.PR1-PR97: Factor 2.RT1-RT435: Factor 3.CD4: Numeric 4.VLt0: Numeric Predict HIV Progression Concepts / Packages: • Caret • train • rfe • randomForest Random Forest Sepal.Length Sepal.Width Petal.Length Petal.Width 5.1 3.5 1.4 0.2 4.9 3 1.4 0.2 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5 3.6 1.4 0.2 5.4 3.9 1.7 0.4 4.6 3.4 1.4 0.3 5 3.4 1.5 0.2 4.4 2.9 1.4 0.2 4.9 3.1 1.5 0.1 5.4 3.7 1.5 0.2 4.8 3.4 1.6 0.2 4.8 3 1.4 0.1 4.3 3 1.1 0.1 5.8 4 1.2 0.2 5.7 4.4 1.5 0.4 5.4 3.9 1.3 0.4 5.1 3.5 1.4 0.3 5.7 3.8 1.7 0.3 5.1 3.8 1.5 0.3 Tree 1: Take a random ~ 63.2% sample of rows from the data set For each node – take mtry random features – in this case 2 would be the default Tree 2: Take a different random ~ 63.2% sample of rows from the data set And so on….. Caret – train TrainData <- iris[,1:4] TrainClasses <- iris[,5] knnFit1 <- train(TrainData, TrainClasses, method = "knn", preProcess = c("center", "scale"), tuneLength = 3, trControl = trainControl(method = "cv", number=10)) Caret – train > knnFit1 150 samples 4 predictors 3 classes: 'setosa', 'versicolor', 'virginica' Pre-processing: centered, scaled Resampling: Cross-Validation (10 fold) Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... Resampling results across tuning parameters: Caret – train k Accuracy 5 0.94 7 0.967 9 0.953 11 0.953 13 0.967 15 0.967 17 0.973 19 0.96 21 0.96 23 0.947 Kappa Accuracy SD 0.91 0.0663 0.95 0.0648 0.93 0.0632 0.93 0.0632 0.95 0.0648 0.95 0.0648 0.96 0.0644 0.94 0.0644 0.94 0.0644 0.92 0.0613 Kappa SD 0.0994 0.0972 0.0949 0.0949 0.0972 0.0972 0.0966 0.0966 0.0966 0.0919 Accuracy was used to select the optimal model using largest value. The final value used for the model was k = 17. the Benefits of winning • • • • • Cold hard cash Several newspaper articles Quoted in Science magazine Prestige Easier to find people willing to team up • Asked to speak at STScI • Perverse pleasure in telling people the team that came in second worked at…. IBM Thomas J. Watson Research Center Chess Ratings Comp Prizes: 1st $10,000.00 Objective: Given 100 months of data predict game outcomes for months 101 – 105. Training Data Provided: 1. 2. 3. 4. Month White Player # Black Player # White Outcome – Win/Draw/Lose (1/0.5/0) How do I convert the data into a flat 2D representation? Think: 1. What are you trying to predict? 2. What Features will you use? Game Feature 2 Game Feature 1 White/Black 4 White/Black 3 White/Black 2 White/Black 1 Black Feature 4 Type of Game Played White Games Played/Black Games Played Number of Games Played 1 Black Feature 3 Number of Games won as White 0.5 Black Feature 2 Percentage of Games Won 1 Black Feature 1 0 White Feature 4 0 Number of Games Played 1 White Feature 3 1 Percentage of Games Won 0 Number of Games won as White 0.5 White Feature 2 White Feature 1 Outcome 1 Packages/Concepts Used: 1. igraph 2. 1st real function Mapping Dark Matter Prizes: 1st ~$3,000.00 The prize will be an expenses paid trip to the Jet Propulsion Laboratory (JPL) in Pasadena, California to attend the GREAT10 challenge workshop "Image Analysis for Cosmology". Objective: “Participants are provided with 100,000 galaxy and star pairs. A participant should provide an estimate for the ellipticity for each galaxy.” dunnhumby's Shopper Challenge Prizes: 1st $6,000.00 2nd $3,000.00 3rd $1,000.00 Objective: • Predict the next date that the customer will make a purchase AND • Predict the amount of the purchase to within £10.00 Data Provided For 100,000 customers: April 1, 2010 – June 19, 2011 1. customer_id 2. visit_date 3. visit_spend For 10,000 customers: April 1, 2010 – March 31, 2011 1. customer_id 2. visit_date 3. visit_spend Really two different challenges: 1) Predict next purchase date Max of ~42.73% obtained 2) Predict purchase amount to within £10.00 Max of ~38.99% obtained If independent 42.73% * 38.99% = 16.66% In reality – max obtained was 18.83% dunnhumby's Shopper Challenge Packages Used & Concepts Explored: 1st competition with real dates • zoo • arima • forecast SVD • svd • irlba SVD Singular value decomposition 807 x 1209 Row 2 Row 3 Row 4 Row N … Col N 4 … D NxN X N Row Features Column Features Row 1 3 Col 4 X 2 Col 3 1 Col 2 = Col 1 Original Matrix 1st Most Important 2nd Most Important 3rd Most Important 4th Most Important ... Nth Most Important U V T 1st N x N 2nd 3rd 4th ... Nth U D V 1st 1 807 x 1209 ~ 1st Most Important X Original Matrix T X x <- read.jpeg("test.image.2.jpg") im <- imagematrix(x, type = "grey") im.svd <- svd(im) u <- im.svd$u d <- diag(im.svd$d) v <- im.svd$v U D V 1st 1 807 x 1209 ~ 1st Most Important X Original Matrix T X new.u <- as.matrix(u[, 1:1]) new.d <- as.matrix(d[1:1, 1:1]) new.v <- as.matrix(v[, 1:1]) new.mat <- new.u %*% new.d %*% t(new.v) new.im <- imagematrix(new.mat, type = "grey") plot(new.im, useRaster = TRUE) U D V 1st 1 807 x 1209 ~ 1st Most Important X Original Matrix X T U D 1 807 x 1209 ~ 1st Most Important 2nd Most Important X Original Matrix V 1st 2nd 2 X T Original Matrix 807 x 1209 ~ 1st Most Important 2nd Most Important 3rd Most Important U 1 X 2 D V 3 1st 2nd 3rd X T Original Matrix 807 x 1209 ~ 1st Most Important 2nd Most Important 3rd Most Important 4th Most Important U 1 X 2 D V 3 1st 2nd 3rd 4th 4 X T Original Matrix 807 x 1209 ~ 1st Most Important 2nd Most Important 3rd Most Important 4th Most Important 5th Most Important U 1 X 2 D V 3 1st 2nd 3rd 4th 5th 4 5 X T Original Matrix 807 x 1209 ~ 1st Most Important 2nd Most Important 3rd Most Important 4th Most Important 5th Most Important 6th Most Important U D 1 X 2 3 4 V 5 X 6 1st 2nd 3rd 4th 5th 6th T Original Matrix 807 x 1209 ~ 1st Most Important 2nd Most Important 3rd Most Important 4th Most Important … 807th Most Important U D 1 X 2 3 4 V . X 8 0 7 T 1st 2nd 3rd 4th … 807th 100,000 x 365 Cust 2 Cust 3 Cust 4 Cust 5 365x365 1st 2nd 3rd 4th ... Nth … Day N V X N Day Features Cust 1 D 4 … = 3 Day 4 X 2 Day 3 1 Day 2 100,000 x 365 Day 1 Original Matrix 1st Most Important 2nd Most Important 3rd Most Important 4th Most Important ... Nth Most Important U Customer Features T 365 x 365 D 1 Original Matrix 100,000 x 365 2 365x365 3 4 … N Original Matrix 100,000 x 365 = 1st Most Important U[,1] = 100,000 x 1 V 1st T = 365 x 1 = V 1st T = 365 x 1 [first 28 shown]= V 2nd T = 365 x 1 [first 28 shown]= V 3rd T = 365 x 1 [first 28 shown]= V 4th T = 365 x 1 [first 28 shown]= V 5th T = 365 x 1 [first 28 shown]= V 6th T = 365 x 1 [first 28 shown]= V 7th T = 365 x 1 [first 28 shown]= V 8th T = 365 x 1 [all 365 shown]= Online Product Sales Prizes: 1st $15,000.00 2nd $ 5,000.00 3rd $ 2,500.00 Objective: “[P]redict monthly online sales of a product. Imagine the products are online self-help programs following an initial advertising campaign.” Online Product Sales Packages/Concepts Explored: 1. Data analysis – looking at data closely 2. gbm 3. Teams Online Product Sales Looking at data closely ... 6532 6532 6661 6661 7696 7701 7701 8229 8412 8895 9596 9596 9772 9772 ... Cat_1=0 Cat_1=1 6274 1 1 6532 1 1 6661 1 1 7696 0 1 7701 1 1 8229 1 0 8412 1 0 8895 1 0 9596 1 1 9772 1 1 Online Product Sales On the public leaderboard: Online Product Sales On the private leaderboard: Thank You! Questions? Extra Slides R Code for Dunnhumby Time Series U D X 4 X 150 = 4 X 150 4X4 V X T 4X4 > my.svd <- svd(iris[,1:4]) > objects(my.svd) [1] "d" "u" "v" > my.svd$d [1] 95.959914 17.761034 3.460931 1.884826 > dim(my.svd$u) [1] 150 4 > dim(my.svd$v) [1] 4 4