docx

advertisement
Homework 1
YOUR NAME HERE
October 22, 2015
Instructions
To prepare for this homework, do the following:
1.
2.
3.
On line 3 of this document, replace "YOUR NAME" with your name.
Rename this file to "hw1_YourHameHere.Rmd", where YourNameHere is changed to
your own name.
Each time you begin a new R session, you must run the setup code below.
Setup
You must run the following code in your R session each time you start R.
## Warning: package 'ElemStatLearn' was built under R version 3.2.2
## Warning: package 'leaps' was built under R version 3.2.2
## Warning: package 'glmnet' was built under R version 3.2.2
## Loading required package: Matrix
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.2.2
## Loaded glmnet 2.0-2
LEAPS
Prostate Example
This is drawn from the help('prostate') page. Note that although the code below appears
in the Rmd view, it does not appear in the Knitted version because I have added
echo=FALSE, eval=FALSE as options to the code chunk.
Activity 1 (20 points)
Recall that best subset selection will only work up to 30 variables or so. To use it, let's trim
down the news dataset. I have picked a few variables that I think, without any prior
information or guidance, may be important.
You must run the following R code
news.small <- news[,
c("n_tokens_title", "n_tokens_content", "n_unique_tokens",
"n_non_stop_words", "n_non_stop_unique_tokens", "num_hrefs",
"num_self_hrefs", "num_imgs", "num_videos", "average_token_length",
"num_keywords", "data_channel_is_lifestyle",
"data_channel_is_entertainment",
"data_channel_is_bus", "data_channel_is_socmed",
"data_channel_is_tech",
"is_weekend", "rate_positive_words", "rate_negative_words",
"lshares", "train")]
## Setup the training data
small.train <- subset(news.small, train==TRUE)
small.train$train <- NULL
small.test <- subset(news.small, train==FALSE)
small.test$train <- NULL
Part A
In the space below, find the best model via leaps when judged by RSS on the training data.
Repeat for AIC and BIC. Compare and contrast these three models and guess which will
have the lowest mean squared error on the test dataset.
small.leaps <- regsubsets( lshares ~ .,
nvmax = 20,
nbest = 1,
method = "exhaustive",
data=small.train)
## Your code here.
Your English words go here.
Part B
After completing Part A, test these three models on the training data. Was your prediction
correct? Explain what may have happened.
## Your code here
Your English words go here.
Step-wise regression
Prostate Example
Note that although the code below appears in the Rmd view, it does not appear in the
Knitted version because I have added echo=FALSE, eval=FALSE as options to the code
chunk.
Activity 2 (20 points)
Use forward and backward selection to find a model using all of the columns instead of
merely the ones I selected earlier. Two questions:
1.
2.
How do the forward and backward models compare with the best model from Activity
1 (e.g number of variables, common variables, etc.)?
Are the predictions (on the test data) better for the stepwise models or the Activity 1
model? Why?
Note that I have added the option cache=TRUE to this code chunk. This is because the
stepwise search that you will do here is slow. With caching on, you only have to wait for it
once, not every time you knit your document.
#R code goes here.
English words go here.
Shrinkage methods
Prostate example: Ridge
Note that although the code below appears in the Rmd view, it does not appear in the
Knitted version because I have added echo=FALSE, eval=FALSE as options to the code
chunk.
Activity 3 (20 points)
Try out ridge regression on the news dataset. Compare its predictions with the best
stepwise model on test. Did it work better? Why or why not?
x
<- as.matrix(news.train[,1:58])
x.test <- as.matrix(news.test[,1:58])
y
<- news.train[,59]
## R code here
English words here.
Prostate Example: LASSO
Note that although the code below appears in the Rmd view, it does not appear in the
Knitted version because I have added echo=FALSE, eval=FALSE as options to the code
chunk.
Activity 4 (20 points)
Try out the LASSO. Compare its predictions with the best models thus far. Does it work
better? Why or why not?
x
<- as.matrix(news.train[,1:58])
x.test <- as.matrix(news.test[,1:58])
y
<- news.train[,59]
## R code goes here
English words go here.
Summary Activity (20 pts)
Now that you have analyzed the news dataset in several ways, pretend that you are in the
following scenario. You work for Mashable.com. Your boss had handed you these data and
asked you to learn something about what factors affect how often articles are shared on
social media.
You've made a lot of graphs. You've generated a lot of numbers. Use the space below to
answer the following two questions:
Thanks, YOUR NAME, for analyzing these data. We would like to increase the number of shares on social media. If
we have a candidate list of 100 articles to put on our front page, how should we chose the 10 to promote there?
Note: In answering the question, you can assume that you have the full set variables, i.e.
there are effectively 100 new rows of the data.
Your Answer here.
Thanks, YOUR NAME, that was a thoughtful and undoubtably correct answer. One more question, do you think
that the method you proposed will still work in 5 years? If so, why? If not, what should we do to stay on top of the
game?
Your Answer here.
Download