Lab 3: Logistic regression models In this lab, we will apply logistic regression models to United States (US) presidential election data sets. The main purpose is to predict the outcomes of presidential election in each state based on the election polls and historical election results data sets. We will use the election polls data sets collected in 2008 and 2012, and the true election outcome data set of 2008 to predict the election outcomes of 2012. It might be more interesting to predict the outcomes of the 2016 presidential election. However, due to the data availability limitation and current stage of election, we will not consider 2016 presidential election in today’s lab. The US presidential election is held every four years on Tuesday after the first Monday in November. The 2016 presidential election date is scheduled for Nov 8, 2016. The 2008 and 2012 elections were held, respectively, on Nov 4, 2008 and Nov 6, 2012. The President of US is not elected directly by popular vote. Instead, the President is elected by electors who are selected by popular vote on a stateby-state basis. These selected electors cast direct votes for the President. Almost all the states except Maine and Nebraska, electors are selected on a “winnertake-all” basis. That is, all electoral votes go to the presidential candidate who wins the most votes in popular vote. For simplicity, we will assume all the states use the “winner-take-all” principle in this lab. The number of electors in each state is the same as the number of congressmen of that state. Currently, there are a total of 538 electors including 435 House representatives, 100 senators and 3 electors from the District of Columbia. A presidential candidate who receives an absolute majority of electoral votes (no less than 270) is elected as President. For simplicity, our data analysis only considers the two major political parties: Democratic (Dem) and Republican (Rep). The interest is to predict which party (Dem or Rep) will win the most votes in each state. Because the chance that a third-party (except Dem and Rep) receives an electoral vote is very small, our simplification is reasonable. Prediction of the outcomes of presidential election campaigns is of great interests to many people. In the past, the prediction was typically made by political analysts and pundits based on their personal experience, intuition and preferences. However, in recent decades, statistical methods have been widely 1 used in predicting election results. Surprisingly, in 2012, statistician Nate Silver correctly predicted the outcome in every state while he successfully called the outcomes in 49 states out of the 50 states in 2008. In today’s lab, we will compare his method (a simplified version) to our method built on the logistic regression models. Date sets The following data sets are available for our data analysis 1) 2) 3) 4) Polling data from the 2008 US presidential election (2008-polls.csv); Election results from the 2008 US presidential election (2008-results.csv); Polling data from the 2012 US presidential election (2012-polls.csv); Election results from the 2012 US presidential election (2012-results.csv). The data sets 1) and 2) will be used for training purpose. That is, the data sets 1) and 2) will be used to build logistic regression models. The data set 3) will be used for prediction. The data set 4) is provided for validation purpose, which can help us to check if our predictions are correct or not. Both polling data sets 1) and 2) contain five columns. The first column is the State Abbreviations (SA). The second and third columns are, respectively, the percentages of votes to Democratic and Republican. The fourth column is the dates that the polls were conducted. The last column is the names of pollster institutions. Election polls Our prediction will be based on election polls. An election poll is a survey that samples a small portion of voters about their vote plans. If the survey is conducted appropriately, the samples of voters should be a representation of the voting population at large. However, it is very challenging to obtain a good representative group because a good sampling strategy needs to consider many factors (e.g., sampling time, locations, methods). Therefore, a poll’s prediction could be biased and the prediction accuracy could be improved by combining multiple polls. There exist many possible factors affecting the prediction accuracy of election polls. Based on the available data sets, we consider the following three factors. 2 1. Sampling time. It is understandable that if the sampling time is far ahead of the election date, the accuracy could be worse than those polls conducted more close to the election date. Because there are many events that could change voters’ opinions about presidential candidates, the longer the time, the more likely voters are going to change their voting plans. 2. Pollsters. Systematic biases could occur if a false sampling method is taken. For example, if a pollster only collects samples through Internet, it would be a biased sample since the sample only includes those who have access to Internet. Each pollster uses different methods for sampling voters. Some sampling schemes could be better than the others. Therefore, it is very likely that some pollsters’ predictions are more reliable than some others. We should not give equal weights to every poll. 3. State edges. The state edge is the difference between the Democratic and Republican popular vote percentages (based on the polls) in that state. For instance, if the Democratic candidate receives 55% of the vote and Republican candidate receives 45% of the votes, then the Democratic edge is 10 percentage points. Because of the sampling errors, if the state edges are small, the prediction accuracy of a poll is more likely to be affected by the sampling errors. However, if the state edges are big, the prediction accuracy is less likely to be affected by sampling errors. Silver’s approach The Nate Silver’s algorithm is described in detail at the FiveThirtyEight blog (http://fivethirtyeight.blogs.nytimes.com/methodology/?_r=0). The key idea of his algorithm is to smooth (average) different polls’ results using a weighted average. Silver’s algorithm gives weight to each pollster according to its prediction accuracy in the previous elections. More biased pollsters will receive less weight. In the following, we briefly describe the general structure of Silver’s algorithm. 1. Calculate the average error of each pollster’s prediction for previous elections. This is known as the pollster’s rank. A smaller rank indicates a more accurate pollster. 2. Transform each rank into a weight. In this lab, we simply set weight as the one over square of rank. In Silver’s algorithm, a number of factors are 3 considered in computing a weight. But we are lack of that information in the available data sets. 3. For each state, compute a weighted average of predictions made by pollsters. This predicts the winner in that state. In this lab, we will compare our method based on the logistic regression models with Silver’s approach in predicting the presidential election winner in each state. To this end, please answer the following questions. Q1. Read the data sets “2008-polls.csv”, “2012-polls.csv” and “2008-results.csv” into R. To simplify our data analysis, let us focus on subsets of these available data sets. We will select the subset of data sets based on pollsters because not all the pollsters conducted polls in every state. For our data analysis, please first select pollsters that conducted at least five polls. Then obtain all the polling data collected by those selected pollsters. Using R to find out the pollsters that conducted at least five polls in both 2008 and 2012 polling data sets 1) and 3). Then create subsets of the 2008 and 2012 polling data sets that are collected by the selected pollsters. Answer: To read the data sets into R, we use the following R code setwd("…") ## Change the directory where you saved the data sets polls2008<-read.csv(file="2008-polls.csv",header=TRUE) polls2012<-read.csv(file="2012-polls.csv",header=TRUE) results2008<-read.csv(file="2008-results.csv",header=TRUE) Because the data sets were stored in csv files, we used the read.csv to read data sets. To select the pollsters who conducted at least five polls, we first create frequency tables for the pollsters in 2008 and 2012 polling data sets. Then we find the pollsters which conducted at five polls. The following R code can be used to select the desired pollsters. pollsters20085<-table(polls2008$Pollster)[table(polls2008$Pollster)>=5] pollsters20125<-table(polls2012$Pollster)[table(polls2012$Pollster)>=5] subset1<names(pollsters20085)[names(pollsters20085)%in%names(pollsters20125)] pollers<-names(pollsters20125)[names(pollsters20125)%in%subset1] Finally, we create the subsets of the 2008 and 2012 data sets that are collected by the selected pollsters using the following R code 4 subsamplesID2008<-polls2008[,5]%in%pollers polls2008sub<-polls2008[subsamplesID2008,] subsamplesID2012<-polls2012[,5]%in%pollers polls2012sub<-polls2012[subsamplesID2012,] Q2. For the purpose of performing logistic regression, we need to define three new variables using data sets created in Q1. First, we define binary response variables (Resp), which is an indicator that indicates if the predictions given by polls are correct or not. If the prediction is correct, we define Resp to be 1 otherwise 0. To check if the prediction given by each poll is correct or not, you could first find out the predicted winner for each state, and then compare it with the actual winner in the data set “2008results.csv”. Second, define state edges based on the definition of the state edges (see above for the definition). Finally, compute the number of days between the sampling time (polling date) and the presidential election date of 2008 (lag time). The 2008 presidential election date is Nov 4, 2008. Combining the above defined variables (Resp, State edge and lag time), State names and pollsters into a new data set. Answer: We first define the response variable based on the 2008 polling and true election results. The following R code could be used for this purpose. winers2008<-(results2008[,2]-results2008[,3]>0)+0 StateID2008<-results2008[,1] Allresponses<-NULL for (sid in 1:51) { polls2008substate<-polls2008sub[polls2008sub$State==StateID2008[sid],] pollwiners2008state<-(polls2008substate[,2]polls2008substate[,3]>0)+0 pollwinersIND<-(pollwiners2008state==winers2008[sid])+0 Allresponses<-c(Allresponses,pollwinersIND) } Then we define the state edges and the lag time using the following R code. Please note that how we compute the new variable lag time. Finally, we combine the new defined variables into a new data set. 5 margins<-abs(polls2008sub[,2]-polls2008sub[,3]) lagtime<-rep(0,dim(polls2008sub)[1]) electiondate2008<-c("Nov 04 2008") for (i in 1:dim(polls2008sub)[1]) { lagtime[i]<-as.Date(electiondate2008, format="%b %d %Y")as.Date(as.character(polls2008sub[i,4]), format="%b %d %Y") } dataset2008<cbind(Allresponses,as.character(polls2008sub[,1]),margins,lagtime,as.c haracter(polls2008sub[,5])) Q3. In the data set created in Q2, you might find that the responses (Resp) of some states are all equal to 1. For these states, the prediction is relatively easy. Therefore, we will focus on the states that are relatively difficult to predict. Please select the states whose responses (Resp) contain at least one 0. Then find the corresponding subsets of the polling data sets for those selected states. Answer: We find the states which have at least one 0 and put them into a list. Then we select the corresponding subset. We use the following R code stateslist<-unique(dataset2008[which(dataset2008[,1]=="0"),2]) subdataset2008<-dataset2008[dataset2008[,2]%in%stateslist,] Q4. Now we fit a logistic regression model using the data set created in Q3. In the model, using Resp as the binary response variable, SA and the Pollsters as categorical predictors, together with the other two predictors defined in Q2: lag time and the state edges. Based on the fitted model, what predictors are significantly associated with Resp? Please also conduct a hypothesis testing to examine if the categorical variable SA is significant or not. Answer: To fit the logistic regression model using R, we first define the following variables based on the data set created in Q3. Since we treat SA and Pollsters as categorical predictors, we use define SA and Pollsters as factors in logistic regression model. To this end, we use the following R code, resp<-as.integer(subdataset2008[,1]) statesFAC<-as.factor(subdataset2008[,2]) margins<-as.double(subdataset2008[,3]) lagtime<-as.double(subdataset2008[,4]) pollersFAC<-as.factor(subdataset2008[,5]) 6 Then we fit a logistic regression model using SA, state edges, lag time and the pollsters as predictors. The following R code is used. logitreg<glm(resp~statesFAC+margins+lagtime+pollersFAC,family="binomial") summary(logitreg) The output of the logistic regression model is given as following: Coefficients: (Intercept) statesFACFL statesFACGA statesFACIN statesFACMA statesFACMI statesFACMN statesFACMO statesFACMT statesFACNC statesFACND statesFACNH statesFACNJ statesFACNM statesFACNV statesFACNY statesFACOH statesFACOR statesFACPA statesFACVA statesFACWA statesFACWI statesFACWV margins lagtime pollersFACEPICMRA pollersFACInsiderAdvantage pollersFACMaristColl pollersFACMasonDixon pollersFACMuhlenbergColl pollersFACQuinnipiacU pollersFACRasmussen pollersFACSienaColl pollersFACSuffolkU pollersFACSurveyUSA pollersFACUofCincinnati pollersFACUofNewHampshire pollersFACZogby --- Estimate Std. Error z value Pr(>|z|) 0.477331 0.602829 0.792 0.428466 -1.647375 0.547285 -3.010 0.002612 ** 1.354619 1.157192 1.171 0.241756 -3.359969 0.926903 -3.625 0.000289 *** 2.064918 1.233361 1.674 0.094087 . -0.109506 0.733656 -0.149 0.881348 1.576421 0.909859 1.733 0.083167 . -0.427149 0.614079 -0.696 0.486683 1.572071 1.165776 1.349 0.177491 -2.289227 0.641511 -3.568 0.000359 *** 0.582515 1.411664 0.413 0.679867 0.608812 0.770412 0.790 0.429386 0.562342 0.953698 0.590 0.555429 0.115791 0.722887 0.160 0.872741 -0.782439 0.620767 -1.260 0.207511 1.106166 1.220608 0.906 0.364808 -1.456813 0.554890 -2.625 0.008655 ** 2.466634 1.227231 2.010 0.044440 * 0.999567 0.706504 1.415 0.157125 -0.764514 0.578862 -1.321 0.186595 2.049390 1.222229 1.677 0.093589 . 1.724056 0.952639 1.810 0.070332 . 0.176470 1.192351 0.148 0.882341 0.243394 0.038387 6.341 2.29e-10 *** -0.010550 0.001722 -6.128 8.89e-10 *** 1.884727 1.341388 1.405 0.160004 0.831820 0.586503 1.418 0.156112 1.899700 1.201596 1.581 0.113883 0.368782 0.590033 0.625 0.531958 -0.107470 1.516623 -0.071 0.943508 1.742448 0.629726 2.767 0.005658 ** 0.273553 0.451894 0.605 0.544948 15.026258 542.747543 0.028 0.977913 1.166058 0.920064 1.267 0.205024 0.831435 0.518039 1.605 0.108501 0.399582 1.113652 0.359 0.719742 -1.361725 1.333940 -1.021 0.307335 0.501113 0.745531 0.672 0.501484 7 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Note that in the above R output. The SA and pollsters are treated as categorical variables. The baseline level for SA is “CO”, which is the Colorado state and the baseline level of pollster is “ARG”. Based on the above output, it is clear to see that state edges (margins) and lag time are statistically significant in affecting the success rate of the response (the prediction accuracy). Among different levels of the categorical variable SA, the states “FL”, “IN”, “NC”, “OH” and “OR” are statistically significant at the nominal level 0.05. This means that these five states are different from the baseline state “CO” in terms of the prediction accuracy given the other predictors fixed. Among the different levels of the categorical variable Pollsters, the pollster “QuinnipiacU” is statistically different from the base line pollster “ARG”. All the other pollsters are not statistically different from the baseline pollster “ARG”. This suggests that the polls conducted by pollster “QuinnipiacU” have better accuracy than “ARG” in predicting the election outcome. To check if the categorical variable “SA” is significant or not, we use a likelihood ratio test by comparing a model with the predictor “SA” and a model without “SA”. logitreg1<-glm(resp~margins+lagtime+pollersFAC,family="binomial") anova(logitreg1,logitreg, test="Chisq") The following out provide the analysis of deviance table and the corresponding pvalue for assessing the significance of the variable “SA”. Analysis of Deviance Table Model 1: resp ~ margins + lagtime + pollersFAC Model 2: resp ~ statesFAC + margins + lagtime + pollersFAC Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 647 620.09 2 625 492.68 22 127.41 < 2.2e-16 *** Because the likelihood ratio test has a p-value much smaller than 0.05, we conclude that the variable “SA” is significant. 8 Q5. Refit the logistic regression model in Q4 without the categorical variable SA. Compare this model with the model fitted in Q4, which one is better? Answer: By delete the categorical variable SA, we refit the model using R code as following: logitreg1<-glm(resp~margins+lagtime+pollersFAC,family="binomial") To compare it with the model fitted in Q4, we use AIC and BIC to select an appropriate model. AIC and BIC are more appropriate for model selection because these information criteria involve the penalties for the complexity of the models. The likelihood ratio test conducted in Q4 shows that model using “SA” can fit the data better than the model without “SA”, which is a goodness-of-fit measure. The following table presents the AIC and BIC values for both models. Model without SA Model with SA AIC 652.0945 568.6820 BIC 724.0429 739.5594 Based on the above table, we can see that the model with SA has smaller AIC value than the model without SA. Therefore, if AIC is used, we should choose the model with SA. However, if BIC value is used, the model without SA is better. The conclusions given by AIC and BIC are different. The reason is that the AIC puts less penalty on the number of parameters but BIC has a larger penalty on the number of free parameters. This also suggests that the model selected by AIC is typically bigger (containing more predictors) than the model selected by BIC. Q6. For the prediction purpose, we need to define new variables: State edges and the lag time for the 2012 polling data set. The definition of these new variables is same as those described in Q2. For computing the lag time, note that the 2012 presidential election date is Nov 6, 2012. Then create a new data set containing these two new variables for the polls conducted by the pollsters selected in Q1 and the states selected in Q3. Based on the logistic regression models fitted in Q4 and Q5, predicting the mean of the response variable (Resp) for the data set just created. The mean of Resp is 9 the probability that Resp=1 (success probability). Please predict the success probability of each poll for the following states: FL, MI, MO and CO. Answer: We define the new variables for prediction using the 2012 polling data set. These variables could be defined as those in Q2. The following R code is used: pollwiners2012<-(polls2012sub[,2]-polls2012sub[,3]>0)+0 margins2012<-abs(polls2012sub[,2]-polls2012sub[,3]) lagtime2012<-rep(0,dim(polls2012sub)[1]) electiondate2012<-c("Nov 06 2012") for (i in 1:dim(polls2012sub)[1]) { lagtime2012[i]<-as.Date(electiondate2012, format="%b %d %Y")as.Date(as.character(polls2012sub[i,4]), format="%b %d %Y") } dataset2012<cbind(pollwiners2012,as.character(polls2012sub[,1]),margins2012,lagtim e2012,as.character(polls2012sub[,5])) For our analysis, we focus on the states in the list created in Q3. subdataset2012<-dataset2012[dataset2012[,2]%in%stateslist,] Using the defined new variables, it is easy to perform predictions for each poll using the function “predict” in R. The predictions for the success probability of poll in MI using model in Q4 could be computed in following R code margins2012<-as.double(subdataset2012[,3]) lagtime2012<-as.double(subdataset2012[,4]) pollersFAC2012<-as.factor(subdataset2012[,5]) NOpolls<-sum(subdataset2012[,2]=="MI") locations<-which(subdataset2012[,2]=="MI") MIPredictresults<cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls)) counts<-0 for (i in locations) { counts<-counts+1 MIdatapoints<-data.frame(statesFAC="MI", margins=margins2012[i], lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i]) MIPredictresults[counts,2]<-predict(logitreg, MIdatapoints, type="response") } 10 The predictions for the polls in MI using model in Q5 is NOpolls<-sum(subdataset2012[,2]=="MI") locations<-which(subdataset2012[,2]=="MI") MIPredictresults1<cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls)) counts<-0 for (i in locations) { counts<-counts+1 MIdatapoints<-data.frame(margins=margins2012[i], lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i]) MIPredictresults1[counts,2]<-predict(logitreg1, MIdatapoints, type="response") } The first row is the predicted winners based on each poll. The predictions of the probabilities using model in Q4 are given in the second column below. The predictions of the probabilities using model in Q5 are given in the third column. [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.7050161 0.9836286 0.7184638 0.8614575 0.7653980 0.7277984 0.9357792 0.9741315 0.9343612 0.9041542 0.7424675 0.9802705 0.8880373 0.9707425 0.8588295 0.9554800 0.7150713 0.9829498 0.8035997 0.8714864 0.8954679 0.7007603 0.9352594 0.9605871 0.9006867 0.8784930 0.7674955 0.9943562 0.8321297 0.9578605 0.7600386 0.9480250 Similar methods could be used for predicting the success probabilities of the polls in the states “FL”, “MO” and “CO”. The R code for the “FL” state is given in the following. The first part applies the model in Q4 margins2012<-as.double(subdataset2012[,3]) lagtime2012<-as.double(subdataset2012[,4]) pollersFAC2012<-as.factor(subdataset2012[,5]) NOpolls<-sum(subdataset2012[,2]=="FL") 11 locations<-which(subdataset2012[,2]=="FL") FLPredictresults<cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls)) counts<-0 for (i in locations) { counts<-counts+1 FLdatapoints<-data.frame(statesFAC="FL", margins=margins2012[i], lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i]) FLPredictresults[counts,2]<-predict(logitreg, FLdatapoints, type="response") } The second part applies the model in Q5 NOpolls<-sum(subdataset2012[,2]=="FL") locations<-which(subdataset2012[,2]=="FL") FLPredictresults1<cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls)) counts<-0 for (i in locations) { counts<-counts+1 FLdatapoints<-data.frame(margins=margins2012[i], lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i]) FLPredictresults1[counts,2]<-predict(logitreg1, FLdatapoints, type="response") } The predictions results are given below. The first row is the predicted winners based on each poll. The second column is the success probabilities of polls predicted using model Q4, and the third column is the predicted probabilities based on model Q5. [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0.56214658 0.14006329 0.23539652 0.21664557 0.13079384 0.05549671 0.64968573 0.53303376 0.07828385 0.06236959 0.12796709 0.64710916 0.51461589 0.06433675 0.8172779 0.5900694 0.4892338 0.4612514 0.4661476 0.3457921 0.8476961 0.7561092 0.2883600 0.2514030 0.3417972 0.8262847 0.7485297 0.3162441 12 [15,] [16,] [17,] [18,] [19,] [20,] [21,] [22,] [23,] [24,] [25,] [26,] [27,] [28,] [29,] [30,] [31,] [32,] [33,] [34,] [35,] [36,] [37,] [38,] [39,] [40,] [41,] [42,] [43,] [44,] [45,] 1 1 1 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0.58318467 0.14151955 0.15701236 0.32868218 0.52997354 0.04629752 0.46001315 0.64411350 0.42661942 0.39076333 0.44675510 0.31911545 0.45085822 0.70216913 0.42993929 0.47746333 0.50577197 0.27092546 0.66321631 0.67279456 0.37520930 0.25649807 0.36904939 0.42084846 0.47564231 0.79508390 0.62016858 0.76200376 0.39465029 0.72876273 0.90964579 0.7768037 0.3723813 0.4587324 0.6065016 0.7448883 0.2763060 0.4928207 0.8359995 0.5886155 0.5313525 0.4479178 0.5337634 0.6782808 0.8345176 0.7249804 0.8461812 0.7283998 0.5018527 0.8422744 0.7316409 0.3721240 0.4712261 0.5639717 0.8986539 0.8111579 0.9341599 0.7645409 0.8903070 0.7093355 0.8704343 0.9562761 The R code for the “MO” state is given in the following. We first apply the model in Q4. margins2012<-as.double(subdataset2012[,3]) lagtime2012<-as.double(subdataset2012[,4]) pollersFAC2012<-as.factor(subdataset2012[,5]) NOpolls<-sum(subdataset2012[,2]=="MO") locations<-which(subdataset2012[,2]=="MO") MOPredictresults<cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls)) counts<-0 for (i in locations) { counts<-counts+1 MOdatapoints<-data.frame(statesFAC="MO", margins=margins2012[i], lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i]) MOPredictresults[counts,2]<-predict(logitreg, MOdatapoints, type="response") 13 } Then we apply the model in Q5 for predicting the success probabilities. NOpolls<-sum(subdataset2012[,2]=="MO") locations<-which(subdataset2012[,2]=="MO") MOPredictresults1<cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls)) counts<-0 for (i in locations) { counts<-counts+1 MOdatapoints<-data.frame(margins=margins2012[i], lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i]) MOPredictresults1[counts,2]<-predict(logitreg1, MOdatapoints, type="response") } The results are summarized below. The first row is the predicted winners based on each poll. The second column is the success probabilities of polls predicted using model Q4, and the third column is the predicted probabilities based on model Q5. [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] 0 0 0 0 0 0 0 0 0 0 0 0 1 0.5035039 0.9694242 0.6044276 0.8194090 0.7910887 0.9278236 0.9421374 0.5542213 0.6769271 0.2520589 0.6137583 0.6647826 0.4416067 0.7200187 0.9710735 0.7044485 0.8628442 0.8080985 0.8966974 0.9413403 0.4922224 0.7092202 0.3617698 0.5711565 0.6007542 0.4014089 The R code and the results for the state “CO” are given below. The first part uses the complex model in Q4. margins2012<-as.double(subdataset2012[,3]) lagtime2012<-as.double(subdataset2012[,4]) pollersFAC2012<-as.factor(subdataset2012[,5]) NOpolls<-sum(subdataset2012[,2]=="CO") COPredictresults<cbind(as.double(subdataset2012[1:NOpolls,1]),rep(0,NOpolls)) 14 for (i in 1:NOpolls) { COdatapoints<-data.frame(statesFAC="CO", margins=margins2012[i], lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i]) COPredictresults[i,2]<-predict(logitreg, COdatapoints, type="response") } The following part performs the prediction for the state "CO" using the simple logistic regression model in Q5. NOpolls<-sum(subdataset2012[,2]=="CO") COPredictresults1<cbind(as.double(subdataset2012[1:NOpolls,1]),rep(0,NOpolls)) for (i in 1:NOpolls) { COdatapoints<-data.frame(margins=margins2012[i], lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i]) COPredictresults1[i,2]<-predict(logitreg1, COdatapoints, type="response") } The predicted probabilities are summarized below. The first row is the predicted winners based on each poll. The second column is the success probabilities of polls predicted using model Q4, and the third column is the predicted probabilities based on model Q5. [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 1 1 0.2966687 0.6704433 0.9217368 0.7045756 0.7585907 0.8257308 0.8497009 0.7255040 0.4452990 0.8973189 0.7802836 0.6515317 0.8016550 0.8738779 0.9037745 0.5948107 0.6632452 0.9559369 0.9081582 0.2437847 0.5091158 0.8403067 0.7054296 0.6682250 0.6908284 0.6723346 0.5371906 0.3148480 0.7094159 0.5773134 0.4903710 0.6377274 0.6823733 0.8141867 0.4947829 0.4669786 0.9366182 0.9219822 15 Q7. In this question, we will predict the winner of each state (FL, MI, MO and CO) using predictions given in Q6. To be concrete, define the winner indicator as 1 (WIND=1) if the Democratic candidate is the winner, otherwise define it as 0. Based on Q6, we could know the probability that a poll made a correct prediction of the winner (i.e. Resp=1). Note that Resp=1 if the variable WIND based on the polling data is the same as the variable WIND based on the actual election data. Then we use the average probability of WIND=1 to predict the probability that Dem wins the election, and use the average probability of WIND=0 to predict the probability that Rep wins the election. The average is across all the predicted probabilities of multiple pollsters who conducted polls in that state. Please do the prediction using both models in Q4 and Q5. Compare your predictions with the actual election results in the data file “2012-results.csv”, what are your conclusions about the accuracy of your predictions? Answer: The predicted probabilities using model Q4 for MI is given by the following R code MIprobDemwin<-MIPredictresults[,1]*MIPredictresults[,2]+(1MIPredictresults[,1])*(1-MIPredictresults[,2]) MImeanProbDemwin<-mean(MIprobDemwin) MIprobGopwin<-(1MIPredictresults[,1])*MIPredictresults[,2]+MIPredictresults[,1]*(1MIPredictresults[,2]) MImeanProbGopwin<-mean(MIprobGopwin) The predicted probabilities using model Q5 for MI is given by the following R code MIprobDemwin1<-MIPredictresults1[,1]*MIPredictresults1[,2]+(1MIPredictresults1[,1])*(1-MIPredictresults1[,2]) MImeanProbDemwin1<-mean(MIprobDemwin1) MIprobGopwin1<-(1MIPredictresults1[,1])*MIPredictresults1[,2]+MIPredictresults1[,1]*(1MIPredictresults1[,2]) MImeanProbGopwin1<-mean(MIprobGopwin1) Similarly, the predicted probabilities for the “FL” state using model Q4 is given by FLprobDemwin<-FLPredictresults[,1]*FLPredictresults[,2]+(1FLPredictresults[,1])*(1-FLPredictresults[,2]) FLmeanProbDemwin<-mean(FLprobDemwin) FLprobGopwin<-(1FLPredictresults[,1])*FLPredictresults[,2]+FLPredictresults[,1]*(1FLPredictresults[,2]) 16 FLmeanProbGopwin<-mean(FLprobGopwin) For model Q5, we use the following R code FLprobDemwin1<-FLPredictresults1[,1]*FLPredictresults1[,2]+(1FLPredictresults1[,1])*(1-FLPredictresults1[,2]) FLmeanProbDemwin1<-mean(FLprobDemwin1) FLprobGopwin1<-(1FLPredictresults1[,1])*FLPredictresults1[,2]+FLPredictresults1[,1]*(1FLPredictresults1[,2]) FLmeanProbGopwin1<-mean(FLprobGopwin1) Since the R code for the state MO and CO are similar to the above two states, we omit the details here. The R code is included in the file “R-code-for-lab-3.txt” on the class page website. We summarize the above prediction results in a table below. MI FL MO CO Model Q4 Dem Rep Winner 0.843 0.156 Dem 0.553 0.447 Dem 0.317 0.683 Rep 0.491 0.509 Rep Dem 0.842 0.577 0.289 0.522 Model Q5 Rep Winner 0.157 Dem 0.422 Dem 0.711 Rep 0.478 Dem Actual Results Dem Dem Rep Dem Based on the above table, we observed that the predictions based on model Q5 are all accurate. But the prediction based on model Q4 is not all correct. The prediction for CO is not correct using model Q4, but the corresponding prediction is correct using model Q5. Q8. Please construct the 95% prediction intervals for the average probabilities predicted in Q7. Answer: The method for constructing the 95% prediction intervals was introduced in one of the notes sent through email. The R code for constructing prediction intervals for the average probabilities using model Q4 is given below for four states “MI”, “FL”, “MO” and “CO”. Deritive<-function(x,beta) { 17 deri0<-exp(x%*%beta) deri1<-deri0/((1+deri0)^2) return(deri1) } ## Prediction intervals for Michigan locations<-which(subdataset2012[,2]=="MI") sub.MI2012<-subdataset2012[locations,] loc2008<-which(subdataset2008[,2]=="MI") SApart<-model.matrix(logitreg)[loc2008[1],c(1:23)] ModMatQ4<-NULL for (i in 1:dim(sub.MI2012)[1]) { pollerloc2008<-which(subdataset2008[,5]==sub.MI2012[i,5]) PollersIND<-model.matrix(logitreg)[pollerloc2008[1],c(26:38)] ModMatQ4<rbind(ModMatQ4,c(SApart,as.numeric(sub.MI2012[i,3:4]),PollersIND)) } Ghat1<-apply(ModMatQ4, 1, Deritive, beta=coef(logitreg)) Ghat2<-ModMatQ4*Ghat1 Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.MI2012[,1]))) Ghat<-colMeans(Ghat3) Varphat<-t(Ghat)%*%vcov(logitreg)%*%Ghat MIPredIntQ4Dem<-c(MImeanProbDemwinqnorm(0.975)*sqrt(Varphat),MImeanProbDemwin+qnorm(0.975)*sqrt(Varphat)) Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.MI2012[,1]))) Ghatrep<-colMeans(Ghat3rep) Varphatrep<-t(Ghatrep)%*%vcov(logitreg)%*%Ghatrep MIPredIntQ4Rep<-c(MImeanProbGopwinqnorm(0.975)*sqrt(Varphatrep),MImeanProbGopwin+qnorm(0.975)*sqrt(Varph atrep)) ## Prediction intervals for Florida locations<-which(subdataset2012[,2]=="FL") sub.FL2012<-subdataset2012[locations,] loc2008<-which(subdataset2008[,2]=="FL") SApart<-model.matrix(logitreg)[loc2008[1],c(1:23)] ModMatQ4<-NULL for (i in 1:dim(sub.FL2012)[1]) { pollerloc2008<-which(subdataset2008[,5]==sub.FL2012[i,5]) PollersIND<-model.matrix(logitreg)[pollerloc2008[1],c(26:38)] ModMatQ4<rbind(ModMatQ4,c(SApart,as.numeric(sub.FL2012[i,3:4]),PollersIND)) } Ghat1<-apply(ModMatQ4, 1, Deritive, beta=coef(logitreg)) Ghat2<-ModMatQ4*Ghat1 Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.FL2012[,1]))) 18 Ghat<-colMeans(Ghat3) Varphat<-t(Ghat)%*%vcov(logitreg)%*%Ghat FLPredIntQ4Dem<-c(FLmeanProbDemwinqnorm(0.975)*sqrt(Varphat),FLmeanProbDemwin+qnorm(0.975)*sqrt(Varphat)) Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.FL2012[,1]))) Ghatrep<-colMeans(Ghat3rep) Varphatrep<-t(Ghatrep)%*%vcov(logitreg)%*%Ghatrep FLPredIntQ4Rep<-c(FLmeanProbGopwinqnorm(0.975)*sqrt(Varphatrep),FLmeanProbGopwin+qnorm(0.975)*sqrt(Varph atrep)) ## Prediction intervals for Missouri locations<-which(subdataset2012[,2]=="MO") sub.MO2012<-subdataset2012[locations,] loc2008<-which(subdataset2008[,2]=="MO") SApart<-model.matrix(logitreg)[loc2008[1],c(1:23)] ModMatQ4<-NULL for (i in 1:dim(sub.MO2012)[1]) { pollerloc2008<-which(subdataset2008[,5]==sub.MO2012[i,5]) PollersIND<-model.matrix(logitreg)[pollerloc2008[1],c(26:38)] ModMatQ4<rbind(ModMatQ4,c(SApart,as.numeric(sub.MO2012[i,3:4]),PollersIND)) } Ghat1<-apply(ModMatQ4, 1, Deritive, beta=coef(logitreg)) Ghat2<-ModMatQ4*Ghat1 Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.MO2012[,1]))) Ghat<-colMeans(Ghat3) Varphat<-t(Ghat)%*%vcov(logitreg)%*%Ghat MOPredIntQ4Dem<-c(MOmeanProbDemwinqnorm(0.975)*sqrt(Varphat),MOmeanProbDemwin+qnorm(0.975)*sqrt(Varphat)) Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.MO2012[,1]))) Ghatrep<-colMeans(Ghat3rep) Varphatrep<-t(Ghatrep)%*%vcov(logitreg)%*%Ghatrep MOPredIntQ4Rep<-c(MOmeanProbGopwinqnorm(0.975)*sqrt(Varphatrep),MOmeanProbGopwin+qnorm(0.975)*sqrt(Varph atrep)) ## Prediction intervals for Colorado locations<-which(subdataset2012[,2]=="CO") sub.CO2012<-subdataset2012[locations,] loc2008<-which(subdataset2008[,2]=="CO") SApart<-model.matrix(logitreg)[loc2008[1],c(1:23)] ModMatQ4<-NULL for (i in 1:dim(sub.CO2012)[1]) { pollerloc2008<-which(subdataset2008[,5]==sub.CO2012[i,5]) PollersIND<-model.matrix(logitreg)[pollerloc2008[1],c(26:38)] 19 ModMatQ4<rbind(ModMatQ4,c(SApart,as.numeric(sub.CO2012[i,3:4]),PollersIND)) } Ghat1<-apply(ModMatQ4, 1, Deritive, beta=coef(logitreg)) Ghat2<-ModMatQ4*Ghat1 Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.CO2012[,1]))) Ghat<-colMeans(Ghat3) Varphat<-t(Ghat)%*%vcov(logitreg)%*%Ghat COPredIntQ4Dem<-c(COmeanProbDemwinqnorm(0.975)*sqrt(Varphat),COmeanProbDemwin+qnorm(0.975)*sqrt(Varphat)) Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.CO2012[,1]))) Ghatrep<-colMeans(Ghat3rep) Varphatrep<-t(Ghatrep)%*%vcov(logitreg)%*%Ghatrep COPredIntQ4Rep<-c(COmeanProbGopwinqnorm(0.975)*sqrt(Varphatrep),COmeanProbGopwin+qnorm(0.975)*sqrt(Varph atrep)) The following table summarizes the prediction intervals for all the states (MI, FL, MO and CO) based on model Q4. MI FL MO CO Dem 0.843 0.553 0.317 0.491 95% Prediction Interval (0.754, 0.933) (0.481, 0.624) (0.200, 0.434) (0.454, 0.527) Rep 0.156 0.447 0.683 0.509 95% Prediction Interval (0.067, 0.246) (0.375, 0.519) (0.566, 0.800) (0.473, 0.546) Based on the above table, we can see that the prediction intervals for MI and MO do not include 0.5. This suggests that the predictions for MI and MO made in Q7 are more reliable. But the prediction intervals for CO and FL include 0.5, which suggests that the predictions made using model Q4 is not very reliable for these two states. The R code for constructing prediction intervals for predictions using Q5 is given below: ## Prediction intervals for Michigan locations<-which(subdataset2012[,2]=="MI") sub.MI2012<-subdataset2012[locations,] ModMatQ5<-NULL for (i in 1:dim(sub.MI2012)[1]) { pollerloc2008<-which(subdataset2008[,5]==sub.MI2012[i,5]) PollersIND<-model.matrix(logitreg1)[pollerloc2008[1],c(4:16)] 20 ModMatQ5<rbind(ModMatQ5,c(1,as.numeric(sub.MI2012[i,3:4]),PollersIND)) } Ghat1<-apply(ModMatQ5, 1, Deritive, beta=coef(logitreg1)) Ghat2<-ModMatQ5*Ghat1 Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.MI2012[,1]))) Ghat<-colMeans(Ghat3) Varphat<-t(Ghat)%*%vcov(logitreg1)%*%Ghat MIPredIntQ5Dem<-c(MImeanProbDemwin1qnorm(0.975)*sqrt(Varphat),MImeanProbDemwin1+qnorm(0.975)*sqrt(Varphat )) Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.MI2012[,1]))) Ghatrep<-colMeans(Ghat3rep) Varphatrep<-t(Ghatrep)%*%vcov(logitreg1)%*%Ghatrep MIPredIntQ5Rep<-c(MImeanProbGopwin1qnorm(0.975)*sqrt(Varphatrep),MImeanProbGopwin1+qnorm(0.975)*sqrt(Varp hatrep)) ## Prediction intervals for Florida locations<-which(subdataset2012[,2]=="FL") sub.FL2012<-subdataset2012[locations,] ModMatQ5<-NULL for (i in 1:dim(sub.FL2012)[1]) { pollerloc2008<-which(subdataset2008[,5]==sub.FL2012[i,5]) PollersIND<-model.matrix(logitreg1)[pollerloc2008[1],c(4:16)] ModMatQ5<rbind(ModMatQ5,c(1,as.numeric(sub.FL2012[i,3:4]),PollersIND)) } Ghat1<-apply(ModMatQ5, 1, Deritive, beta=coef(logitreg1)) Ghat2<-ModMatQ5*Ghat1 Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.FL2012[,1]))) Ghat<-colMeans(Ghat3) Varphat<-t(Ghat)%*%vcov(logitreg1)%*%Ghat FLPredIntQ5Dem<-c(FLmeanProbDemwin1qnorm(0.975)*sqrt(Varphat),FLmeanProbDemwin1+qnorm(0.975)*sqrt(Varphat )) Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.FL2012[,1]))) Ghatrep<-colMeans(Ghat3rep) Varphatrep<-t(Ghatrep)%*%vcov(logitreg1)%*%Ghatrep FLPredIntQ5Rep<-c(FLmeanProbGopwin1qnorm(0.975)*sqrt(Varphatrep),FLmeanProbGopwin1+qnorm(0.975)*sqrt(Varp hatrep)) ## Prediction intervals for Missouri locations<-which(subdataset2012[,2]=="MO") sub.MO2012<-subdataset2012[locations,] ModMatQ5<-NULL 21 for (i in 1:dim(sub.MO2012)[1]) { pollerloc2008<-which(subdataset2008[,5]==sub.MO2012[i,5]) PollersIND<-model.matrix(logitreg1)[pollerloc2008[1],c(4:16)] ModMatQ5<rbind(ModMatQ5,c(1,as.numeric(sub.MO2012[i,3:4]),PollersIND)) } Ghat1<-apply(ModMatQ5, 1, Deritive, beta=coef(logitreg1)) Ghat2<-ModMatQ5*Ghat1 Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.MO2012[,1]))) Ghat<-colMeans(Ghat3) Varphat<-t(Ghat)%*%vcov(logitreg1)%*%Ghat MOPredIntQ5Dem<-c(MOmeanProbDemwin1qnorm(0.975)*sqrt(Varphat),MOmeanProbDemwin1+qnorm(0.975)*sqrt(Varphat )) Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.MO2012[,1]))) Ghatrep<-colMeans(Ghat3rep) Varphatrep<-t(Ghatrep)%*%vcov(logitreg1)%*%Ghatrep MOPredIntQ5Rep<-c(MOmeanProbGopwin1qnorm(0.975)*sqrt(Varphatrep),MOmeanProbGopwin1+qnorm(0.975)*sqrt(Varp hatrep)) ## Prediction intervals for Colorado locations<-which(subdataset2012[,2]=="CO") sub.CO2012<-subdataset2012[locations,] ModMatQ5<-NULL for (i in 1:dim(sub.CO2012)[1]) { pollerloc2008<-which(subdataset2008[,5]==sub.CO2012[i,5]) PollersIND<-model.matrix(logitreg1)[pollerloc2008[1],c(4:16)] ModMatQ5<rbind(ModMatQ5,c(1,as.numeric(sub.CO2012[i,3:4]),PollersIND)) } Ghat1<-apply(ModMatQ5, 1, Deritive, beta=coef(logitreg1)) Ghat2<-ModMatQ5*Ghat1 Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.CO2012[,1]))) Ghat<-colMeans(Ghat3) Varphat<-t(Ghat)%*%vcov(logitreg1)%*%Ghat COPredIntQ5Dem<-c(COmeanProbDemwin1qnorm(0.975)*sqrt(Varphat),COmeanProbDemwin1+qnorm(0.975)*sqrt(Varphat )) Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.CO2012[,1]))) Ghatrep<-colMeans(Ghat3rep) Varphatrep<-t(Ghatrep)%*%vcov(logitreg1)%*%Ghatrep COPredIntQ5Rep<-c(COmeanProbGopwin1qnorm(0.975)*sqrt(Varphatrep),COmeanProbGopwin1+qnorm(0.975)*sqrt(Varp hatrep)) 22 The following table summarizes the prediction intervals for all the states (MI, FL, MO and CO) based on model Q5. MI FL MO CO Dem 0.842 0.577 0.289 0.522 95% Prediction Interval (0.781, 0.902) (0.545, 0.611) (0.253, 0.326) (0.499, 0.545) Rep 0.157 0.422 0.711 0.478 95% Prediction Interval (0.097, 0.218) (0.389, 0.455) (0.674, 0.747) (0.454, 0.500) Based on the above table, similar to the prediction interval table for Q4, we can see that the prediction intervals for MI, FL, MO do not include 0.5, But the prediction intervals for CO include 0.5. This suggests that the predictions for MI, FL and MO made in Q7 are reliable but the prediction for CO is not very reliable. However, comparing the prediction intervals for CO given by the model Q4, we found that the lengths of prediction intervals using Q5 are narrower. This might suggest that the prediction using model Q5 is more reliable than that given by model Q4. Q9. Finally, implement the Silver’s approach to the data sets created in Q3 and Q6 to predict the winners for states considered in Q6 (namely, FL, MI, MO and CO). Please compare the accuracy of the predictions using Silver’s approach and our approach. Answer: According to the algorithm described in at the beginning of the lab, we could implement them step by step using the following R code: ## Step 1: Compute average errors statedgesbypolls<-polls2008sub[,2]-polls2008sub[,3] subSEbypolls<-statedgesbypolls[dataset2008[,2]%in%stateslist] subSEwithstates<-cbind(subdataset2008[,c(2,5)],subSEbypolls) trueSE<-results2008[,2]-results2008[,3] trueSEexpand<-rep(0,length(subSEbypolls)) for (i in 1:length(subSEbypolls)) { loc<-which(results2008[,1]==subSEwithstates[i,1]) trueSEexpand[i]<-trueSE[loc] } Errors<-abs(as.numeric(subSEs[,3])-trueSEexpand) subSEs<-cbind(subSEwithstates,trueSEexpand,Errors) Errorbypollers<-tapply(Errors,subSEs[,2],mean) 23 ## Step 2: Compute weights rankPollers<-rank(Errorbypollers) weights<-1/(rankPollers^2) ## Step 3: Compute weighted averages poll2012<-polls2012sub[dataset2012[,2]%in%stateslist,] poll2012weights<-rep(0,dim(poll2012)[1]) for (i in 1:length(poll2012weights)) { locwei<-which(names(weights)==poll2012[i,5]) poll2012weights[i]<-weights[locwei] } DemPollsWei<-poll2012[,2]*poll2012weights RepPollsWei<-poll2012[,3]*poll2012weights DemAveNA<tapply(DemPollsWei,poll2012[,1],sum)/tapply(poll2012weights,poll2012[, 1],sum) DemAve<-DemAveNA[!is.na(DemAveNA)] RepAveNA<tapply(RepPollsWei,poll2012[,1],sum)/tapply(poll2012weights,poll2012[, 1],sum) RepAve<-RepAveNA[!is.na(RepAveNA)] The following results are the weighted averages for the four states (“MI”, “FL”, “MO” and “CO”) computed by the Silver’s approach: CO FL MI MO Dem 47.72712 47.08507 49.35732 42.89962 Rep 47.06949 46.62246 41.43978 50.45573 Prediction Dem Dem Dem Rep Based on the weighted averages given in the above table, Nate Silver’s approach predicts all the results correctly. Therefore, the accuracy of Silver’s approach is the same as our method based on model Q5 but better than that based on model Q4. 24