Homework 1 YOUR NAME HERE October 22, 2015 Instructions To prepare for this homework, do the following: 1. 2. 3. On line 3 of this document, replace "YOUR NAME" with your name. Rename this file to "hw1_YourHameHere.Rmd", where YourNameHere is changed to your own name. Each time you begin a new R session, you must run the setup code below. Setup You must run the following code in your R session each time you start R. ## Warning: package 'ElemStatLearn' was built under R version 3.2.2 ## Warning: package 'leaps' was built under R version 3.2.2 ## Warning: package 'glmnet' was built under R version 3.2.2 ## Loading required package: Matrix ## Loading required package: foreach ## Warning: package 'foreach' was built under R version 3.2.2 ## Loaded glmnet 2.0-2 LEAPS Prostate Example This is drawn from the help('prostate') page. Note that although the code below appears in the Rmd view, it does not appear in the Knitted version because I have added echo=FALSE, eval=FALSE as options to the code chunk. Activity 1 (20 points) Recall that best subset selection will only work up to 30 variables or so. To use it, let's trim down the news dataset. I have picked a few variables that I think, without any prior information or guidance, may be important. You must run the following R code news.small <- news[, c("n_tokens_title", "n_tokens_content", "n_unique_tokens", "n_non_stop_words", "n_non_stop_unique_tokens", "num_hrefs", "num_self_hrefs", "num_imgs", "num_videos", "average_token_length", "num_keywords", "data_channel_is_lifestyle", "data_channel_is_entertainment", "data_channel_is_bus", "data_channel_is_socmed", "data_channel_is_tech", "is_weekend", "rate_positive_words", "rate_negative_words", "lshares", "train")] ## Setup the training data small.train <- subset(news.small, train==TRUE) small.train$train <- NULL small.test <- subset(news.small, train==FALSE) small.test$train <- NULL Part A In the space below, find the best model via leaps when judged by RSS on the training data. Repeat for AIC and BIC. Compare and contrast these three models and guess which will have the lowest mean squared error on the test dataset. small.leaps <- regsubsets( lshares ~ ., nvmax = 20, nbest = 1, method = "exhaustive", data=small.train) ## Your code here. Your English words go here. Part B After completing Part A, test these three models on the training data. Was your prediction correct? Explain what may have happened. ## Your code here Your English words go here. Step-wise regression Prostate Example Note that although the code below appears in the Rmd view, it does not appear in the Knitted version because I have added echo=FALSE, eval=FALSE as options to the code chunk. Activity 2 (20 points) Use forward and backward selection to find a model using all of the columns instead of merely the ones I selected earlier. Two questions: 1. 2. How do the forward and backward models compare with the best model from Activity 1 (e.g number of variables, common variables, etc.)? Are the predictions (on the test data) better for the stepwise models or the Activity 1 model? Why? Note that I have added the option cache=TRUE to this code chunk. This is because the stepwise search that you will do here is slow. With caching on, you only have to wait for it once, not every time you knit your document. #R code goes here. English words go here. Shrinkage methods Prostate example: Ridge Note that although the code below appears in the Rmd view, it does not appear in the Knitted version because I have added echo=FALSE, eval=FALSE as options to the code chunk. Activity 3 (20 points) Try out ridge regression on the news dataset. Compare its predictions with the best stepwise model on test. Did it work better? Why or why not? x <- as.matrix(news.train[,1:58]) x.test <- as.matrix(news.test[,1:58]) y <- news.train[,59] ## R code here English words here. Prostate Example: LASSO Note that although the code below appears in the Rmd view, it does not appear in the Knitted version because I have added echo=FALSE, eval=FALSE as options to the code chunk. Activity 4 (20 points) Try out the LASSO. Compare its predictions with the best models thus far. Does it work better? Why or why not? x <- as.matrix(news.train[,1:58]) x.test <- as.matrix(news.test[,1:58]) y <- news.train[,59] ## R code goes here English words go here. Summary Activity (20 pts) Now that you have analyzed the news dataset in several ways, pretend that you are in the following scenario. You work for Mashable.com. Your boss had handed you these data and asked you to learn something about what factors affect how often articles are shared on social media. You've made a lot of graphs. You've generated a lot of numbers. Use the space below to answer the following two questions: Thanks, YOUR NAME, for analyzing these data. We would like to increase the number of shares on social media. If we have a candidate list of 100 articles to put on our front page, how should we chose the 10 to promote there? Note: In answering the question, you can assume that you have the full set variables, i.e. there are effectively 100 new rows of the data. Your Answer here. Thanks, YOUR NAME, that was a thoughtful and undoubtably correct answer. One more question, do you think that the method you proposed will still work in 5 years? If so, why? If not, what should we do to stay on top of the game? Your Answer here.