Lab 3: Logistic regression models

advertisement
Lab 3: Logistic regression models
In this lab, we will apply logistic regression models to United States (US)
presidential election data sets. The main purpose is to predict the outcomes of
presidential election in each state based on the election polls and historical
election results data sets. We will use the election polls data sets collected in
2008 and 2012, and the true election outcome data set of 2008 to predict the
election outcomes of 2012. It might be more interesting to predict the outcomes
of the 2016 presidential election. However, due to the data availability limitation
and current stage of election, we will not consider 2016 presidential election in
today’s lab.
The US presidential election is held every four years on Tuesday after the first
Monday in November. The 2016 presidential election date is scheduled for Nov 8,
2016. The 2008 and 2012 elections were held, respectively, on Nov 4, 2008 and
Nov 6, 2012. The President of US is not elected directly by popular vote. Instead,
the President is elected by electors who are selected by popular vote on a stateby-state basis. These selected electors cast direct votes for the President. Almost
all the states except Maine and Nebraska, electors are selected on a “winnertake-all” basis. That is, all electoral votes go to the presidential candidate who
wins the most votes in popular vote. For simplicity, we will assume all the states
use the “winner-take-all” principle in this lab. The number of electors in each
state is the same as the number of congressmen of that state. Currently, there are
a total of 538 electors including 435 House representatives, 100 senators and 3
electors from the District of Columbia. A presidential candidate who receives an
absolute majority of electoral votes (no less than 270) is elected as President.
For simplicity, our data analysis only considers the two major political parties:
Democratic (Dem) and Republican (Rep). The interest is to predict which party
(Dem or Rep) will win the most votes in each state. Because the chance that a
third-party (except Dem and Rep) receives an electoral vote is very small, our
simplification is reasonable.
Prediction of the outcomes of presidential election campaigns is of great interests
to many people. In the past, the prediction was typically made by political
analysts and pundits based on their personal experience, intuition and
preferences. However, in recent decades, statistical methods have been widely
1
used in predicting election results. Surprisingly, in 2012, statistician Nate Silver
correctly predicted the outcome in every state while he successfully called the
outcomes in 49 states out of the 50 states in 2008. In today’s lab, we will compare
his method (a simplified version) to our method built on the logistic regression
models.
Date sets
The following data sets are available for our data analysis
1)
2)
3)
4)
Polling data from the 2008 US presidential election (2008-polls.csv);
Election results from the 2008 US presidential election (2008-results.csv);
Polling data from the 2012 US presidential election (2012-polls.csv);
Election results from the 2012 US presidential election (2012-results.csv).
The data sets 1) and 2) will be used for training purpose. That is, the data sets 1)
and 2) will be used to build logistic regression models. The data set 3) will be used
for prediction. The data set 4) is provided for validation purpose, which can help
us to check if our predictions are correct or not.
Both polling data sets 1) and 2) contain five columns. The first column is the State
Abbreviations (SA). The second and third columns are, respectively, the
percentages of votes to Democratic and Republican. The fourth column is the
dates that the polls were conducted. The last column is the names of pollster
institutions.
Election polls
Our prediction will be based on election polls. An election poll is a survey that
samples a small portion of voters about their vote plans. If the survey is
conducted appropriately, the samples of voters should be a representation of the
voting population at large. However, it is very challenging to obtain a good
representative group because a good sampling strategy needs to consider many
factors (e.g., sampling time, locations, methods). Therefore, a poll’s prediction
could be biased and the prediction accuracy could be improved by combining
multiple polls.
There exist many possible factors affecting the prediction accuracy of election
polls. Based on the available data sets, we consider the following three factors.
2
1. Sampling time. It is understandable that if the sampling time is far ahead of
the election date, the accuracy could be worse than those polls conducted
more close to the election date. Because there are many events that could
change voters’ opinions about presidential candidates, the longer the time,
the more likely voters are going to change their voting plans.
2. Pollsters. Systematic biases could occur if a false sampling method is taken.
For example, if a pollster only collects samples through Internet, it would
be a biased sample since the sample only includes those who have access
to Internet. Each pollster uses different methods for sampling voters. Some
sampling schemes could be better than the others. Therefore, it is very
likely that some pollsters’ predictions are more reliable than some others.
We should not give equal weights to every poll.
3. State edges. The state edge is the difference between the Democratic and
Republican popular vote percentages (based on the polls) in that state. For
instance, if the Democratic candidate receives 55% of the vote and
Republican candidate receives 45% of the votes, then the Democratic edge
is 10 percentage points. Because of the sampling errors, if the state edges
are small, the prediction accuracy of a poll is more likely to be affected by
the sampling errors. However, if the state edges are big, the prediction
accuracy is less likely to be affected by sampling errors.
Silver’s approach
The Nate Silver’s algorithm is described in detail at the FiveThirtyEight blog
(http://fivethirtyeight.blogs.nytimes.com/methodology/?_r=0). The key idea of
his algorithm is to smooth (average) different polls’ results using a weighted
average. Silver’s algorithm gives weight to each pollster according to its prediction
accuracy in the previous elections. More biased pollsters will receive less weight.
In the following, we briefly describe the general structure of Silver’s algorithm.
1. Calculate the average error of each pollster’s prediction for previous
elections. This is known as the pollster’s rank. A smaller rank indicates a
more accurate pollster.
2. Transform each rank into a weight. In this lab, we simply set weight as the
one over square of rank. In Silver’s algorithm, a number of factors are
3
considered in computing a weight. But we are lack of that information in
the available data sets.
3. For each state, compute a weighted average of predictions made by
pollsters. This predicts the winner in that state.
In this lab, we will compare our method based on the logistic regression models
with Silver’s approach in predicting the presidential election winner in each state.
To this end, please answer the following questions.
Q1. Read the data sets “2008-polls.csv”, “2012-polls.csv” and “2008-results.csv”
into R.
To simplify our data analysis, let us focus on subsets of these available data sets.
We will select the subset of data sets based on pollsters because not all the
pollsters conducted polls in every state. For our data analysis, please first select
pollsters that conducted at least five polls. Then obtain all the polling data
collected by those selected pollsters. Using R to find out the pollsters that
conducted at least five polls in both 2008 and 2012 polling data sets 1) and 3).
Then create subsets of the 2008 and 2012 polling data sets that are collected by
the selected pollsters.
Answer: To read the data sets into R, we use the following R code
setwd("…") ## Change the directory where you saved the data sets
polls2008<-read.csv(file="2008-polls.csv",header=TRUE)
polls2012<-read.csv(file="2012-polls.csv",header=TRUE)
results2008<-read.csv(file="2008-results.csv",header=TRUE)
Because the data sets were stored in csv files, we used the read.csv to read data
sets. To select the pollsters who conducted at least five polls, we first create
frequency tables for the pollsters in 2008 and 2012 polling data sets. Then we find
the pollsters which conducted at five polls. The following R code can be used to
select the desired pollsters.
pollsters20085<-table(polls2008$Pollster)[table(polls2008$Pollster)>=5]
pollsters20125<-table(polls2012$Pollster)[table(polls2012$Pollster)>=5]
subset1<names(pollsters20085)[names(pollsters20085)%in%names(pollsters20125)]
pollers<-names(pollsters20125)[names(pollsters20125)%in%subset1]
Finally, we create the subsets of the 2008 and 2012 data sets that are collected by
the selected pollsters using the following R code
4
subsamplesID2008<-polls2008[,5]%in%pollers
polls2008sub<-polls2008[subsamplesID2008,]
subsamplesID2012<-polls2012[,5]%in%pollers
polls2012sub<-polls2012[subsamplesID2012,]
Q2. For the purpose of performing logistic regression, we need to define three
new variables using data sets created in Q1.
First, we define binary response variables (Resp), which is an indicator that
indicates if the predictions given by polls are correct or not. If the prediction is
correct, we define Resp to be 1 otherwise 0. To check if the prediction given by
each poll is correct or not, you could first find out the predicted winner for each
state, and then compare it with the actual winner in the data set “2008results.csv”.
Second, define state edges based on the definition of the state edges (see above
for the definition).
Finally, compute the number of days between the sampling time (polling date)
and the presidential election date of 2008 (lag time). The 2008 presidential
election date is Nov 4, 2008.
Combining the above defined variables (Resp, State edge and lag time), State
names and pollsters into a new data set.
Answer: We first define the response variable based on the 2008 polling and true
election results. The following R code could be used for this purpose.
winers2008<-(results2008[,2]-results2008[,3]>0)+0
StateID2008<-results2008[,1]
Allresponses<-NULL
for (sid in 1:51)
{
polls2008substate<-polls2008sub[polls2008sub$State==StateID2008[sid],]
pollwiners2008state<-(polls2008substate[,2]polls2008substate[,3]>0)+0
pollwinersIND<-(pollwiners2008state==winers2008[sid])+0
Allresponses<-c(Allresponses,pollwinersIND)
}
Then we define the state edges and the lag time using the following R code.
Please note that how we compute the new variable lag time. Finally, we combine
the new defined variables into a new data set.
5
margins<-abs(polls2008sub[,2]-polls2008sub[,3])
lagtime<-rep(0,dim(polls2008sub)[1])
electiondate2008<-c("Nov 04 2008")
for (i in 1:dim(polls2008sub)[1])
{
lagtime[i]<-as.Date(electiondate2008, format="%b %d %Y")as.Date(as.character(polls2008sub[i,4]), format="%b %d %Y")
}
dataset2008<cbind(Allresponses,as.character(polls2008sub[,1]),margins,lagtime,as.c
haracter(polls2008sub[,5]))
Q3. In the data set created in Q2, you might find that the responses (Resp) of
some states are all equal to 1. For these states, the prediction is relatively easy.
Therefore, we will focus on the states that are relatively difficult to predict. Please
select the states whose responses (Resp) contain at least one 0. Then find the
corresponding subsets of the polling data sets for those selected states.
Answer: We find the states which have at least one 0 and put them into a list.
Then we select the corresponding subset. We use the following R code
stateslist<-unique(dataset2008[which(dataset2008[,1]=="0"),2])
subdataset2008<-dataset2008[dataset2008[,2]%in%stateslist,]
Q4. Now we fit a logistic regression model using the data set created in Q3. In the
model, using Resp as the binary response variable, SA and the Pollsters as
categorical predictors, together with the other two predictors defined in Q2: lag
time and the state edges. Based on the fitted model, what predictors are
significantly associated with Resp? Please also conduct a hypothesis testing to
examine if the categorical variable SA is significant or not.
Answer: To fit the logistic regression model using R, we first define the following
variables based on the data set created in Q3. Since we treat SA and Pollsters as
categorical predictors, we use define SA and Pollsters as factors in logistic
regression model. To this end, we use the following R code,
resp<-as.integer(subdataset2008[,1])
statesFAC<-as.factor(subdataset2008[,2])
margins<-as.double(subdataset2008[,3])
lagtime<-as.double(subdataset2008[,4])
pollersFAC<-as.factor(subdataset2008[,5])
6
Then we fit a logistic regression model using SA, state edges, lag time and the
pollsters as predictors. The following R code is used.
logitreg<glm(resp~statesFAC+margins+lagtime+pollersFAC,family="binomial")
summary(logitreg)
The output of the logistic regression model is given as following:
Coefficients:
(Intercept)
statesFACFL
statesFACGA
statesFACIN
statesFACMA
statesFACMI
statesFACMN
statesFACMO
statesFACMT
statesFACNC
statesFACND
statesFACNH
statesFACNJ
statesFACNM
statesFACNV
statesFACNY
statesFACOH
statesFACOR
statesFACPA
statesFACVA
statesFACWA
statesFACWI
statesFACWV
margins
lagtime
pollersFACEPICMRA
pollersFACInsiderAdvantage
pollersFACMaristColl
pollersFACMasonDixon
pollersFACMuhlenbergColl
pollersFACQuinnipiacU
pollersFACRasmussen
pollersFACSienaColl
pollersFACSuffolkU
pollersFACSurveyUSA
pollersFACUofCincinnati
pollersFACUofNewHampshire
pollersFACZogby
---
Estimate Std. Error z value Pr(>|z|)
0.477331
0.602829
0.792 0.428466
-1.647375
0.547285 -3.010 0.002612 **
1.354619
1.157192
1.171 0.241756
-3.359969
0.926903 -3.625 0.000289 ***
2.064918
1.233361
1.674 0.094087 .
-0.109506
0.733656 -0.149 0.881348
1.576421
0.909859
1.733 0.083167 .
-0.427149
0.614079 -0.696 0.486683
1.572071
1.165776
1.349 0.177491
-2.289227
0.641511 -3.568 0.000359 ***
0.582515
1.411664
0.413 0.679867
0.608812
0.770412
0.790 0.429386
0.562342
0.953698
0.590 0.555429
0.115791
0.722887
0.160 0.872741
-0.782439
0.620767 -1.260 0.207511
1.106166
1.220608
0.906 0.364808
-1.456813
0.554890 -2.625 0.008655 **
2.466634
1.227231
2.010 0.044440 *
0.999567
0.706504
1.415 0.157125
-0.764514
0.578862 -1.321 0.186595
2.049390
1.222229
1.677 0.093589 .
1.724056
0.952639
1.810 0.070332 .
0.176470
1.192351
0.148 0.882341
0.243394
0.038387
6.341 2.29e-10 ***
-0.010550
0.001722 -6.128 8.89e-10 ***
1.884727
1.341388
1.405 0.160004
0.831820
0.586503
1.418 0.156112
1.899700
1.201596
1.581 0.113883
0.368782
0.590033
0.625 0.531958
-0.107470
1.516623 -0.071 0.943508
1.742448
0.629726
2.767 0.005658 **
0.273553
0.451894
0.605 0.544948
15.026258 542.747543
0.028 0.977913
1.166058
0.920064
1.267 0.205024
0.831435
0.518039
1.605 0.108501
0.399582
1.113652
0.359 0.719742
-1.361725
1.333940 -1.021 0.307335
0.501113
0.745531
0.672 0.501484
7
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that in the above R output. The SA and pollsters are treated as categorical
variables. The baseline level for SA is “CO”, which is the Colorado state and the
baseline level of pollster is “ARG”. Based on the above output, it is clear to see
that state edges (margins) and lag time are statistically significant in affecting the
success rate of the response (the prediction accuracy). Among different levels of
the categorical variable SA, the states “FL”, “IN”, “NC”, “OH” and “OR” are
statistically significant at the nominal level 0.05. This means that these five states
are different from the baseline state “CO” in terms of the prediction accuracy
given the other predictors fixed. Among the different levels of the categorical
variable Pollsters, the pollster “QuinnipiacU” is statistically different from the base
line pollster “ARG”. All the other pollsters are not statistically different from the
baseline pollster “ARG”. This suggests that the polls conducted by pollster
“QuinnipiacU” have better accuracy than “ARG” in predicting the election
outcome.
To check if the categorical variable “SA” is significant or not, we use a likelihood
ratio test by comparing a model with the predictor “SA” and a model without
“SA”.
logitreg1<-glm(resp~margins+lagtime+pollersFAC,family="binomial")
anova(logitreg1,logitreg, test="Chisq")
The following out provide the analysis of deviance table and the corresponding pvalue for assessing the significance of the variable “SA”.
Analysis of Deviance Table
Model 1: resp ~ margins + lagtime + pollersFAC
Model 2: resp ~ statesFAC + margins + lagtime + pollersFAC
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1
647
620.09
2
625
492.68 22
127.41 < 2.2e-16 ***
Because the likelihood ratio test has a p-value much smaller than 0.05, we
conclude that the variable “SA” is significant.
8
Q5. Refit the logistic regression model in Q4 without the categorical variable SA.
Compare this model with the model fitted in Q4, which one is better?
Answer: By delete the categorical variable SA, we refit the model using R code as
following:
logitreg1<-glm(resp~margins+lagtime+pollersFAC,family="binomial")
To compare it with the model fitted in Q4, we use AIC and BIC to select an
appropriate model. AIC and BIC are more appropriate for model selection
because these information criteria involve the penalties for the complexity of the
models. The likelihood ratio test conducted in Q4 shows that model using “SA”
can fit the data better than the model without “SA”, which is a goodness-of-fit
measure. The following table presents the AIC and BIC values for both models.
Model without SA
Model with SA
AIC
652.0945
568.6820
BIC
724.0429
739.5594
Based on the above table, we can see that the model with SA has smaller AIC
value than the model without SA. Therefore, if AIC is used, we should choose the
model with SA. However, if BIC value is used, the model without SA is better. The
conclusions given by AIC and BIC are different. The reason is that the AIC puts less
penalty on the number of parameters but BIC has a larger penalty on the number
of free parameters. This also suggests that the model selected by AIC is typically
bigger (containing more predictors) than the model selected by BIC.
Q6. For the prediction purpose, we need to define new variables: State edges and
the lag time for the 2012 polling data set. The definition of these new variables is
same as those described in Q2. For computing the lag time, note that the 2012
presidential election date is Nov 6, 2012. Then create a new data set containing
these two new variables for the polls conducted by the pollsters selected in Q1
and the states selected in Q3.
Based on the logistic regression models fitted in Q4 and Q5, predicting the mean
of the response variable (Resp) for the data set just created. The mean of Resp is
9
the probability that Resp=1 (success probability). Please predict the success
probability of each poll for the following states: FL, MI, MO and CO.
Answer: We define the new variables for prediction using the 2012 polling data
set. These variables could be defined as those in Q2. The following R code is used:
pollwiners2012<-(polls2012sub[,2]-polls2012sub[,3]>0)+0
margins2012<-abs(polls2012sub[,2]-polls2012sub[,3])
lagtime2012<-rep(0,dim(polls2012sub)[1])
electiondate2012<-c("Nov 06 2012")
for (i in 1:dim(polls2012sub)[1])
{
lagtime2012[i]<-as.Date(electiondate2012, format="%b %d %Y")as.Date(as.character(polls2012sub[i,4]), format="%b %d %Y")
}
dataset2012<cbind(pollwiners2012,as.character(polls2012sub[,1]),margins2012,lagtim
e2012,as.character(polls2012sub[,5]))
For our analysis, we focus on the states in the list created in Q3.
subdataset2012<-dataset2012[dataset2012[,2]%in%stateslist,]
Using the defined new variables, it is easy to perform predictions for each poll
using the function “predict” in R. The predictions for the success probability of
poll in MI using model in Q4 could be computed in following R code
margins2012<-as.double(subdataset2012[,3])
lagtime2012<-as.double(subdataset2012[,4])
pollersFAC2012<-as.factor(subdataset2012[,5])
NOpolls<-sum(subdataset2012[,2]=="MI")
locations<-which(subdataset2012[,2]=="MI")
MIPredictresults<cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls))
counts<-0
for (i in locations)
{
counts<-counts+1
MIdatapoints<-data.frame(statesFAC="MI", margins=margins2012[i],
lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i])
MIPredictresults[counts,2]<-predict(logitreg, MIdatapoints,
type="response")
}
10
The predictions for the polls in MI using model in Q5 is
NOpolls<-sum(subdataset2012[,2]=="MI")
locations<-which(subdataset2012[,2]=="MI")
MIPredictresults1<cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls))
counts<-0
for (i in locations)
{
counts<-counts+1
MIdatapoints<-data.frame(margins=margins2012[i],
lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i])
MIPredictresults1[counts,2]<-predict(logitreg1, MIdatapoints,
type="response")
}
The first row is the predicted winners based on each poll. The predictions of the
probabilities using model in Q4 are given in the second column below. The
predictions of the probabilities using model in Q5 are given in the third column.
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[16,]
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.7050161
0.9836286
0.7184638
0.8614575
0.7653980
0.7277984
0.9357792
0.9741315
0.9343612
0.9041542
0.7424675
0.9802705
0.8880373
0.9707425
0.8588295
0.9554800
0.7150713
0.9829498
0.8035997
0.8714864
0.8954679
0.7007603
0.9352594
0.9605871
0.9006867
0.8784930
0.7674955
0.9943562
0.8321297
0.9578605
0.7600386
0.9480250
Similar methods could be used for predicting the success probabilities of the polls
in the states “FL”, “MO” and “CO”.
The R code for the “FL” state is given in the following. The first part applies the
model in Q4
margins2012<-as.double(subdataset2012[,3])
lagtime2012<-as.double(subdataset2012[,4])
pollersFAC2012<-as.factor(subdataset2012[,5])
NOpolls<-sum(subdataset2012[,2]=="FL")
11
locations<-which(subdataset2012[,2]=="FL")
FLPredictresults<cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls))
counts<-0
for (i in locations)
{
counts<-counts+1
FLdatapoints<-data.frame(statesFAC="FL", margins=margins2012[i],
lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i])
FLPredictresults[counts,2]<-predict(logitreg, FLdatapoints,
type="response")
}
The second part applies the model in Q5
NOpolls<-sum(subdataset2012[,2]=="FL")
locations<-which(subdataset2012[,2]=="FL")
FLPredictresults1<cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls))
counts<-0
for (i in locations)
{
counts<-counts+1
FLdatapoints<-data.frame(margins=margins2012[i],
lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i])
FLPredictresults1[counts,2]<-predict(logitreg1, FLdatapoints,
type="response")
}
The predictions results are given below. The first row is the predicted winners
based on each poll. The second column is the success probabilities of polls
predicted using model Q4, and the third column is the predicted probabilities
based on model Q5.
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
0
0
0
0
0
0
0
1
0
0
0
0
1
1
0.56214658
0.14006329
0.23539652
0.21664557
0.13079384
0.05549671
0.64968573
0.53303376
0.07828385
0.06236959
0.12796709
0.64710916
0.51461589
0.06433675
0.8172779
0.5900694
0.4892338
0.4612514
0.4661476
0.3457921
0.8476961
0.7561092
0.2883600
0.2514030
0.3417972
0.8262847
0.7485297
0.3162441
12
[15,]
[16,]
[17,]
[18,]
[19,]
[20,]
[21,]
[22,]
[23,]
[24,]
[25,]
[26,]
[27,]
[28,]
[29,]
[30,]
[31,]
[32,]
[33,]
[34,]
[35,]
[36,]
[37,]
[38,]
[39,]
[40,]
[41,]
[42,]
[43,]
[44,]
[45,]
1
1
1
0
0
1
1
1
1
0
0
0
0
0
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
0.58318467
0.14151955
0.15701236
0.32868218
0.52997354
0.04629752
0.46001315
0.64411350
0.42661942
0.39076333
0.44675510
0.31911545
0.45085822
0.70216913
0.42993929
0.47746333
0.50577197
0.27092546
0.66321631
0.67279456
0.37520930
0.25649807
0.36904939
0.42084846
0.47564231
0.79508390
0.62016858
0.76200376
0.39465029
0.72876273
0.90964579
0.7768037
0.3723813
0.4587324
0.6065016
0.7448883
0.2763060
0.4928207
0.8359995
0.5886155
0.5313525
0.4479178
0.5337634
0.6782808
0.8345176
0.7249804
0.8461812
0.7283998
0.5018527
0.8422744
0.7316409
0.3721240
0.4712261
0.5639717
0.8986539
0.8111579
0.9341599
0.7645409
0.8903070
0.7093355
0.8704343
0.9562761
The R code for the “MO” state is given in the following. We first apply the model
in Q4.
margins2012<-as.double(subdataset2012[,3])
lagtime2012<-as.double(subdataset2012[,4])
pollersFAC2012<-as.factor(subdataset2012[,5])
NOpolls<-sum(subdataset2012[,2]=="MO")
locations<-which(subdataset2012[,2]=="MO")
MOPredictresults<cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls))
counts<-0
for (i in locations)
{
counts<-counts+1
MOdatapoints<-data.frame(statesFAC="MO", margins=margins2012[i],
lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i])
MOPredictresults[counts,2]<-predict(logitreg, MOdatapoints,
type="response")
13
}
Then we apply the model in Q5 for predicting the success probabilities.
NOpolls<-sum(subdataset2012[,2]=="MO")
locations<-which(subdataset2012[,2]=="MO")
MOPredictresults1<cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls))
counts<-0
for (i in locations)
{
counts<-counts+1
MOdatapoints<-data.frame(margins=margins2012[i],
lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i])
MOPredictresults1[counts,2]<-predict(logitreg1, MOdatapoints,
type="response")
}
The results are summarized below. The first row is the predicted winners based
on each poll. The second column is the success probabilities of polls predicted
using model Q4, and the third column is the predicted probabilities based on
model Q5.
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
0
0
0
0
0
0
0
0
0
0
0
0
1
0.5035039
0.9694242
0.6044276
0.8194090
0.7910887
0.9278236
0.9421374
0.5542213
0.6769271
0.2520589
0.6137583
0.6647826
0.4416067
0.7200187
0.9710735
0.7044485
0.8628442
0.8080985
0.8966974
0.9413403
0.4922224
0.7092202
0.3617698
0.5711565
0.6007542
0.4014089
The R code and the results for the state “CO” are given below. The first part uses
the complex model in Q4.
margins2012<-as.double(subdataset2012[,3])
lagtime2012<-as.double(subdataset2012[,4])
pollersFAC2012<-as.factor(subdataset2012[,5])
NOpolls<-sum(subdataset2012[,2]=="CO")
COPredictresults<cbind(as.double(subdataset2012[1:NOpolls,1]),rep(0,NOpolls))
14
for (i in 1:NOpolls)
{
COdatapoints<-data.frame(statesFAC="CO", margins=margins2012[i],
lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i])
COPredictresults[i,2]<-predict(logitreg, COdatapoints,
type="response")
}
The following part performs the prediction for the state "CO" using the simple
logistic regression model in Q5.
NOpolls<-sum(subdataset2012[,2]=="CO")
COPredictresults1<cbind(as.double(subdataset2012[1:NOpolls,1]),rep(0,NOpolls))
for (i in 1:NOpolls)
{
COdatapoints<-data.frame(margins=margins2012[i],
lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i])
COPredictresults1[i,2]<-predict(logitreg1, COdatapoints,
type="response")
}
The predicted probabilities are summarized below. The first row is the predicted
winners based on each poll. The second column is the success probabilities of
polls predicted using model Q4, and the third column is the predicted
probabilities based on model Q5.
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[16,]
[17,]
[18,]
[19,]
0
0
0
1
0
0
1
1
0
0
0
0
0
1
0
1
1
1
1
0.2966687
0.6704433
0.9217368
0.7045756
0.7585907
0.8257308
0.8497009
0.7255040
0.4452990
0.8973189
0.7802836
0.6515317
0.8016550
0.8738779
0.9037745
0.5948107
0.6632452
0.9559369
0.9081582
0.2437847
0.5091158
0.8403067
0.7054296
0.6682250
0.6908284
0.6723346
0.5371906
0.3148480
0.7094159
0.5773134
0.4903710
0.6377274
0.6823733
0.8141867
0.4947829
0.4669786
0.9366182
0.9219822
15
Q7. In this question, we will predict the winner of each state (FL, MI, MO and CO)
using predictions given in Q6. To be concrete, define the winner indicator as 1
(WIND=1) if the Democratic candidate is the winner, otherwise define it as 0.
Based on Q6, we could know the probability that a poll made a correct prediction
of the winner (i.e. Resp=1). Note that Resp=1 if the variable WIND based on the
polling data is the same as the variable WIND based on the actual election data.
Then we use the average probability of WIND=1 to predict the probability that
Dem wins the election, and use the average probability of WIND=0 to predict the
probability that Rep wins the election. The average is across all the predicted
probabilities of multiple pollsters who conducted polls in that state. Please do the
prediction using both models in Q4 and Q5. Compare your predictions with the
actual election results in the data file “2012-results.csv”, what are your
conclusions about the accuracy of your predictions?
Answer: The predicted probabilities using model Q4 for MI is given by the
following R code
MIprobDemwin<-MIPredictresults[,1]*MIPredictresults[,2]+(1MIPredictresults[,1])*(1-MIPredictresults[,2])
MImeanProbDemwin<-mean(MIprobDemwin)
MIprobGopwin<-(1MIPredictresults[,1])*MIPredictresults[,2]+MIPredictresults[,1]*(1MIPredictresults[,2])
MImeanProbGopwin<-mean(MIprobGopwin)
The predicted probabilities using model Q5 for MI is given by the following R code
MIprobDemwin1<-MIPredictresults1[,1]*MIPredictresults1[,2]+(1MIPredictresults1[,1])*(1-MIPredictresults1[,2])
MImeanProbDemwin1<-mean(MIprobDemwin1)
MIprobGopwin1<-(1MIPredictresults1[,1])*MIPredictresults1[,2]+MIPredictresults1[,1]*(1MIPredictresults1[,2])
MImeanProbGopwin1<-mean(MIprobGopwin1)
Similarly, the predicted probabilities for the “FL” state using model Q4 is given by
FLprobDemwin<-FLPredictresults[,1]*FLPredictresults[,2]+(1FLPredictresults[,1])*(1-FLPredictresults[,2])
FLmeanProbDemwin<-mean(FLprobDemwin)
FLprobGopwin<-(1FLPredictresults[,1])*FLPredictresults[,2]+FLPredictresults[,1]*(1FLPredictresults[,2])
16
FLmeanProbGopwin<-mean(FLprobGopwin)
For model Q5, we use the following R code
FLprobDemwin1<-FLPredictresults1[,1]*FLPredictresults1[,2]+(1FLPredictresults1[,1])*(1-FLPredictresults1[,2])
FLmeanProbDemwin1<-mean(FLprobDemwin1)
FLprobGopwin1<-(1FLPredictresults1[,1])*FLPredictresults1[,2]+FLPredictresults1[,1]*(1FLPredictresults1[,2])
FLmeanProbGopwin1<-mean(FLprobGopwin1)
Since the R code for the state MO and CO are similar to the above two states, we
omit the details here. The R code is included in the file “R-code-for-lab-3.txt” on
the class page website.
We summarize the above prediction results in a table below.
MI
FL
MO
CO
Model Q4
Dem Rep Winner
0.843 0.156 Dem
0.553 0.447 Dem
0.317 0.683 Rep
0.491 0.509
Rep
Dem
0.842
0.577
0.289
0.522
Model Q5
Rep Winner
0.157 Dem
0.422 Dem
0.711 Rep
0.478 Dem
Actual Results
Dem
Dem
Rep
Dem
Based on the above table, we observed that the predictions based on model Q5
are all accurate. But the prediction based on model Q4 is not all correct. The
prediction for CO is not correct using model Q4, but the corresponding prediction
is correct using model Q5.
Q8. Please construct the 95% prediction intervals for the average probabilities
predicted in Q7.
Answer: The method for constructing the 95% prediction intervals was
introduced in one of the notes sent through email. The R code for constructing
prediction intervals for the average probabilities using model Q4 is given below
for four states “MI”, “FL”, “MO” and “CO”.
Deritive<-function(x,beta)
{
17
deri0<-exp(x%*%beta)
deri1<-deri0/((1+deri0)^2)
return(deri1)
}
## Prediction intervals for Michigan
locations<-which(subdataset2012[,2]=="MI")
sub.MI2012<-subdataset2012[locations,]
loc2008<-which(subdataset2008[,2]=="MI")
SApart<-model.matrix(logitreg)[loc2008[1],c(1:23)]
ModMatQ4<-NULL
for (i in 1:dim(sub.MI2012)[1])
{
pollerloc2008<-which(subdataset2008[,5]==sub.MI2012[i,5])
PollersIND<-model.matrix(logitreg)[pollerloc2008[1],c(26:38)]
ModMatQ4<rbind(ModMatQ4,c(SApart,as.numeric(sub.MI2012[i,3:4]),PollersIND))
}
Ghat1<-apply(ModMatQ4, 1, Deritive, beta=coef(logitreg))
Ghat2<-ModMatQ4*Ghat1
Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.MI2012[,1])))
Ghat<-colMeans(Ghat3)
Varphat<-t(Ghat)%*%vcov(logitreg)%*%Ghat
MIPredIntQ4Dem<-c(MImeanProbDemwinqnorm(0.975)*sqrt(Varphat),MImeanProbDemwin+qnorm(0.975)*sqrt(Varphat))
Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.MI2012[,1])))
Ghatrep<-colMeans(Ghat3rep)
Varphatrep<-t(Ghatrep)%*%vcov(logitreg)%*%Ghatrep
MIPredIntQ4Rep<-c(MImeanProbGopwinqnorm(0.975)*sqrt(Varphatrep),MImeanProbGopwin+qnorm(0.975)*sqrt(Varph
atrep))
## Prediction intervals for Florida
locations<-which(subdataset2012[,2]=="FL")
sub.FL2012<-subdataset2012[locations,]
loc2008<-which(subdataset2008[,2]=="FL")
SApart<-model.matrix(logitreg)[loc2008[1],c(1:23)]
ModMatQ4<-NULL
for (i in 1:dim(sub.FL2012)[1])
{
pollerloc2008<-which(subdataset2008[,5]==sub.FL2012[i,5])
PollersIND<-model.matrix(logitreg)[pollerloc2008[1],c(26:38)]
ModMatQ4<rbind(ModMatQ4,c(SApart,as.numeric(sub.FL2012[i,3:4]),PollersIND))
}
Ghat1<-apply(ModMatQ4, 1, Deritive, beta=coef(logitreg))
Ghat2<-ModMatQ4*Ghat1
Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.FL2012[,1])))
18
Ghat<-colMeans(Ghat3)
Varphat<-t(Ghat)%*%vcov(logitreg)%*%Ghat
FLPredIntQ4Dem<-c(FLmeanProbDemwinqnorm(0.975)*sqrt(Varphat),FLmeanProbDemwin+qnorm(0.975)*sqrt(Varphat))
Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.FL2012[,1])))
Ghatrep<-colMeans(Ghat3rep)
Varphatrep<-t(Ghatrep)%*%vcov(logitreg)%*%Ghatrep
FLPredIntQ4Rep<-c(FLmeanProbGopwinqnorm(0.975)*sqrt(Varphatrep),FLmeanProbGopwin+qnorm(0.975)*sqrt(Varph
atrep))
## Prediction intervals for Missouri
locations<-which(subdataset2012[,2]=="MO")
sub.MO2012<-subdataset2012[locations,]
loc2008<-which(subdataset2008[,2]=="MO")
SApart<-model.matrix(logitreg)[loc2008[1],c(1:23)]
ModMatQ4<-NULL
for (i in 1:dim(sub.MO2012)[1])
{
pollerloc2008<-which(subdataset2008[,5]==sub.MO2012[i,5])
PollersIND<-model.matrix(logitreg)[pollerloc2008[1],c(26:38)]
ModMatQ4<rbind(ModMatQ4,c(SApart,as.numeric(sub.MO2012[i,3:4]),PollersIND))
}
Ghat1<-apply(ModMatQ4, 1, Deritive, beta=coef(logitreg))
Ghat2<-ModMatQ4*Ghat1
Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.MO2012[,1])))
Ghat<-colMeans(Ghat3)
Varphat<-t(Ghat)%*%vcov(logitreg)%*%Ghat
MOPredIntQ4Dem<-c(MOmeanProbDemwinqnorm(0.975)*sqrt(Varphat),MOmeanProbDemwin+qnorm(0.975)*sqrt(Varphat))
Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.MO2012[,1])))
Ghatrep<-colMeans(Ghat3rep)
Varphatrep<-t(Ghatrep)%*%vcov(logitreg)%*%Ghatrep
MOPredIntQ4Rep<-c(MOmeanProbGopwinqnorm(0.975)*sqrt(Varphatrep),MOmeanProbGopwin+qnorm(0.975)*sqrt(Varph
atrep))
## Prediction intervals for Colorado
locations<-which(subdataset2012[,2]=="CO")
sub.CO2012<-subdataset2012[locations,]
loc2008<-which(subdataset2008[,2]=="CO")
SApart<-model.matrix(logitreg)[loc2008[1],c(1:23)]
ModMatQ4<-NULL
for (i in 1:dim(sub.CO2012)[1])
{
pollerloc2008<-which(subdataset2008[,5]==sub.CO2012[i,5])
PollersIND<-model.matrix(logitreg)[pollerloc2008[1],c(26:38)]
19
ModMatQ4<rbind(ModMatQ4,c(SApart,as.numeric(sub.CO2012[i,3:4]),PollersIND))
}
Ghat1<-apply(ModMatQ4, 1, Deritive, beta=coef(logitreg))
Ghat2<-ModMatQ4*Ghat1
Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.CO2012[,1])))
Ghat<-colMeans(Ghat3)
Varphat<-t(Ghat)%*%vcov(logitreg)%*%Ghat
COPredIntQ4Dem<-c(COmeanProbDemwinqnorm(0.975)*sqrt(Varphat),COmeanProbDemwin+qnorm(0.975)*sqrt(Varphat))
Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.CO2012[,1])))
Ghatrep<-colMeans(Ghat3rep)
Varphatrep<-t(Ghatrep)%*%vcov(logitreg)%*%Ghatrep
COPredIntQ4Rep<-c(COmeanProbGopwinqnorm(0.975)*sqrt(Varphatrep),COmeanProbGopwin+qnorm(0.975)*sqrt(Varph
atrep))
The following table summarizes the prediction intervals for all the states (MI, FL,
MO and CO) based on model Q4.
MI
FL
MO
CO
Dem
0.843
0.553
0.317
0.491
95% Prediction Interval
(0.754, 0.933)
(0.481, 0.624)
(0.200, 0.434)
(0.454, 0.527)
Rep
0.156
0.447
0.683
0.509
95% Prediction Interval
(0.067, 0.246)
(0.375, 0.519)
(0.566, 0.800)
(0.473, 0.546)
Based on the above table, we can see that the prediction intervals for MI and MO
do not include 0.5. This suggests that the predictions for MI and MO made in Q7
are more reliable. But the prediction intervals for CO and FL include 0.5, which
suggests that the predictions made using model Q4 is not very reliable for these
two states.
The R code for constructing prediction intervals for predictions using Q5 is given
below:
## Prediction intervals for Michigan
locations<-which(subdataset2012[,2]=="MI")
sub.MI2012<-subdataset2012[locations,]
ModMatQ5<-NULL
for (i in 1:dim(sub.MI2012)[1])
{
pollerloc2008<-which(subdataset2008[,5]==sub.MI2012[i,5])
PollersIND<-model.matrix(logitreg1)[pollerloc2008[1],c(4:16)]
20
ModMatQ5<rbind(ModMatQ5,c(1,as.numeric(sub.MI2012[i,3:4]),PollersIND))
}
Ghat1<-apply(ModMatQ5, 1, Deritive, beta=coef(logitreg1))
Ghat2<-ModMatQ5*Ghat1
Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.MI2012[,1])))
Ghat<-colMeans(Ghat3)
Varphat<-t(Ghat)%*%vcov(logitreg1)%*%Ghat
MIPredIntQ5Dem<-c(MImeanProbDemwin1qnorm(0.975)*sqrt(Varphat),MImeanProbDemwin1+qnorm(0.975)*sqrt(Varphat
))
Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.MI2012[,1])))
Ghatrep<-colMeans(Ghat3rep)
Varphatrep<-t(Ghatrep)%*%vcov(logitreg1)%*%Ghatrep
MIPredIntQ5Rep<-c(MImeanProbGopwin1qnorm(0.975)*sqrt(Varphatrep),MImeanProbGopwin1+qnorm(0.975)*sqrt(Varp
hatrep))
## Prediction intervals for Florida
locations<-which(subdataset2012[,2]=="FL")
sub.FL2012<-subdataset2012[locations,]
ModMatQ5<-NULL
for (i in 1:dim(sub.FL2012)[1])
{
pollerloc2008<-which(subdataset2008[,5]==sub.FL2012[i,5])
PollersIND<-model.matrix(logitreg1)[pollerloc2008[1],c(4:16)]
ModMatQ5<rbind(ModMatQ5,c(1,as.numeric(sub.FL2012[i,3:4]),PollersIND))
}
Ghat1<-apply(ModMatQ5, 1, Deritive, beta=coef(logitreg1))
Ghat2<-ModMatQ5*Ghat1
Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.FL2012[,1])))
Ghat<-colMeans(Ghat3)
Varphat<-t(Ghat)%*%vcov(logitreg1)%*%Ghat
FLPredIntQ5Dem<-c(FLmeanProbDemwin1qnorm(0.975)*sqrt(Varphat),FLmeanProbDemwin1+qnorm(0.975)*sqrt(Varphat
))
Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.FL2012[,1])))
Ghatrep<-colMeans(Ghat3rep)
Varphatrep<-t(Ghatrep)%*%vcov(logitreg1)%*%Ghatrep
FLPredIntQ5Rep<-c(FLmeanProbGopwin1qnorm(0.975)*sqrt(Varphatrep),FLmeanProbGopwin1+qnorm(0.975)*sqrt(Varp
hatrep))
## Prediction intervals for Missouri
locations<-which(subdataset2012[,2]=="MO")
sub.MO2012<-subdataset2012[locations,]
ModMatQ5<-NULL
21
for (i in 1:dim(sub.MO2012)[1])
{
pollerloc2008<-which(subdataset2008[,5]==sub.MO2012[i,5])
PollersIND<-model.matrix(logitreg1)[pollerloc2008[1],c(4:16)]
ModMatQ5<rbind(ModMatQ5,c(1,as.numeric(sub.MO2012[i,3:4]),PollersIND))
}
Ghat1<-apply(ModMatQ5, 1, Deritive, beta=coef(logitreg1))
Ghat2<-ModMatQ5*Ghat1
Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.MO2012[,1])))
Ghat<-colMeans(Ghat3)
Varphat<-t(Ghat)%*%vcov(logitreg1)%*%Ghat
MOPredIntQ5Dem<-c(MOmeanProbDemwin1qnorm(0.975)*sqrt(Varphat),MOmeanProbDemwin1+qnorm(0.975)*sqrt(Varphat
))
Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.MO2012[,1])))
Ghatrep<-colMeans(Ghat3rep)
Varphatrep<-t(Ghatrep)%*%vcov(logitreg1)%*%Ghatrep
MOPredIntQ5Rep<-c(MOmeanProbGopwin1qnorm(0.975)*sqrt(Varphatrep),MOmeanProbGopwin1+qnorm(0.975)*sqrt(Varp
hatrep))
## Prediction intervals for Colorado
locations<-which(subdataset2012[,2]=="CO")
sub.CO2012<-subdataset2012[locations,]
ModMatQ5<-NULL
for (i in 1:dim(sub.CO2012)[1])
{
pollerloc2008<-which(subdataset2008[,5]==sub.CO2012[i,5])
PollersIND<-model.matrix(logitreg1)[pollerloc2008[1],c(4:16)]
ModMatQ5<rbind(ModMatQ5,c(1,as.numeric(sub.CO2012[i,3:4]),PollersIND))
}
Ghat1<-apply(ModMatQ5, 1, Deritive, beta=coef(logitreg1))
Ghat2<-ModMatQ5*Ghat1
Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.CO2012[,1])))
Ghat<-colMeans(Ghat3)
Varphat<-t(Ghat)%*%vcov(logitreg1)%*%Ghat
COPredIntQ5Dem<-c(COmeanProbDemwin1qnorm(0.975)*sqrt(Varphat),COmeanProbDemwin1+qnorm(0.975)*sqrt(Varphat
))
Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.CO2012[,1])))
Ghatrep<-colMeans(Ghat3rep)
Varphatrep<-t(Ghatrep)%*%vcov(logitreg1)%*%Ghatrep
COPredIntQ5Rep<-c(COmeanProbGopwin1qnorm(0.975)*sqrt(Varphatrep),COmeanProbGopwin1+qnorm(0.975)*sqrt(Varp
hatrep))
22
The following table summarizes the prediction intervals for all the states (MI, FL,
MO and CO) based on model Q5.
MI
FL
MO
CO
Dem
0.842
0.577
0.289
0.522
95% Prediction Interval
(0.781, 0.902)
(0.545, 0.611)
(0.253, 0.326)
(0.499, 0.545)
Rep
0.157
0.422
0.711
0.478
95% Prediction Interval
(0.097, 0.218)
(0.389, 0.455)
(0.674, 0.747)
(0.454, 0.500)
Based on the above table, similar to the prediction interval table for Q4, we can
see that the prediction intervals for MI, FL, MO do not include 0.5, But the
prediction intervals for CO include 0.5. This suggests that the predictions for MI,
FL and MO made in Q7 are reliable but the prediction for CO is not very reliable.
However, comparing the prediction intervals for CO given by the model Q4, we
found that the lengths of prediction intervals using Q5 are narrower. This might
suggest that the prediction using model Q5 is more reliable than that given by
model Q4.
Q9. Finally, implement the Silver’s approach to the data sets created in Q3 and Q6
to predict the winners for states considered in Q6 (namely, FL, MI, MO and CO).
Please compare the accuracy of the predictions using Silver’s approach and our
approach.
Answer: According to the algorithm described in at the beginning of the lab, we
could implement them step by step using the following R code:
## Step 1: Compute average errors
statedgesbypolls<-polls2008sub[,2]-polls2008sub[,3]
subSEbypolls<-statedgesbypolls[dataset2008[,2]%in%stateslist]
subSEwithstates<-cbind(subdataset2008[,c(2,5)],subSEbypolls)
trueSE<-results2008[,2]-results2008[,3]
trueSEexpand<-rep(0,length(subSEbypolls))
for (i in 1:length(subSEbypolls))
{
loc<-which(results2008[,1]==subSEwithstates[i,1])
trueSEexpand[i]<-trueSE[loc]
}
Errors<-abs(as.numeric(subSEs[,3])-trueSEexpand)
subSEs<-cbind(subSEwithstates,trueSEexpand,Errors)
Errorbypollers<-tapply(Errors,subSEs[,2],mean)
23
## Step 2: Compute weights
rankPollers<-rank(Errorbypollers)
weights<-1/(rankPollers^2)
## Step 3: Compute weighted averages
poll2012<-polls2012sub[dataset2012[,2]%in%stateslist,]
poll2012weights<-rep(0,dim(poll2012)[1])
for (i in 1:length(poll2012weights))
{
locwei<-which(names(weights)==poll2012[i,5])
poll2012weights[i]<-weights[locwei]
}
DemPollsWei<-poll2012[,2]*poll2012weights
RepPollsWei<-poll2012[,3]*poll2012weights
DemAveNA<tapply(DemPollsWei,poll2012[,1],sum)/tapply(poll2012weights,poll2012[,
1],sum)
DemAve<-DemAveNA[!is.na(DemAveNA)]
RepAveNA<tapply(RepPollsWei,poll2012[,1],sum)/tapply(poll2012weights,poll2012[,
1],sum)
RepAve<-RepAveNA[!is.na(RepAveNA)]
The following results are the weighted averages for the four states (“MI”, “FL”,
“MO” and “CO”) computed by the Silver’s approach:
CO
FL
MI
MO
Dem
47.72712 47.08507 49.35732 42.89962
Rep
47.06949 46.62246 41.43978 50.45573
Prediction Dem
Dem
Dem
Rep
Based on the weighted averages given in the above table, Nate Silver’s approach
predicts all the results correctly. Therefore, the accuracy of Silver’s approach is
the same as our method based on model Q5 but better than that based on model
Q4.
24
Download