General Linear Models in R Stat 430

advertisement
General Linear Models
in R
Stat 430
Outline
• Model Comparison and Selection
• Interaction Effects
• Diagnostics
• Logistic Regression
Flights Project
•
Accuracy is measured as
(lower is better)
1�
(yi − yˆi )2
n
i
mi+1,j+1 mij
log θij = log
= ... = βi+1 − βi
mi+1,j mi,j+1
•
mi+1,j+1
m
ij set into 90%
What YOU
should
do:
divide
data
log θij = log
= ... = β
mi,j+1
i+1,jData
Training Set and 10%mTest
Set
•
2 + s2 ) 2
(s
2
Fit model on
set,1 compute
accuracy of your
k =training
4 /(n − 1) + s4 /(n − 1)
1
2
model on test sset.
2
1
•
X̄
X̄2data.
−d
1−
Compute accuracy
on
test
�
s21 /n1 + s22 /n2
How do we do that in R?
• first: generate random sample using
function sample
• second: split data in training and test data
Random samples in R
> help(sample)
> # sample is generating a random sample of size 'size' from values
1 through n
> sample(10,4)
[1] 4 6 2 3
> sample(10,4)
[1] 2 4 3 1
> # repeated execution will give different values
>
> # in order to get the exact same random sequence, we need to set
the random seed
> set.seed(12345)
> sample(10,4)
[1] 8 10 7 9
>
> set.seed(12345)
> sample(10,4)
[1] 8 10 7 9
Test and Training Data
> # in order to get reproducible results it's generally a good
idea to set the random seed
> # sample 90% of the flights data
> set.seed(987654)
> idx <- sample(nrow(flights), 0.9*nrow(flights))
> train <- flights[idx,]
> test <- flights[-idx,]
Your Turn
• load the flights data from
"http://
www.public.iastate.edu/~hofmann/stat430/
project/flights-train.csv"
• split the data into test and training data
Model Fitting
• lm fits a linear model
• fit a linear model with dependent variable
ArrDelay and independent variable Distance.
How does Distance affect arrival delays?
Use summary to investigate
• investigate the resulting object using str
how can we get a vector of coefficients out
of the model object?
Estimates and Intervals
•
•
coef, vcov, confint, fitted, resid,
and rstudent
are functions important for working with (linear)
models.
Use help(command) to find out what each one
of these functions does, use them to answer the
questions below.
Is Distance a variable that significantly contributes
to the model (i.e. is the coefficient of Distance
significantly different from 0?) Give a 90%
confidence interval for the effect of Distance on
arrival delays
Estimates and Intervals
•
coef, vcov, confint, fitted, and
resid
are functions important for working with
(linear) models.
Use help(command) to find out what each
one of these functions does, use them to
answer the questions below.
•
Do fitted values and residuals show a twodimensional pattern? Draw a plot of residuals
(on y) versus fitted values (on x).
Use qplot
Residual Plots:
fitted vs residuals
• under model assumptions fitted values are
independent of residuals - we should not
see trends or patterns
• residuals should have same error variance we should see a “band” around zero of
same height across Y
• only 5% of residual values should be above
2 or below -2
Residual Plots:
explanatory vs residuals
• under model assumptions X values are
independent of residuals - we should not
see trends or patterns
• residuals should have same error variance we should see a “band” around zero of
same height across X
• only 5% of residual values should be above
2 or below -2
Model Predictions
• predict allows to get values from a model
based on a new dataset:
> help(predict)
> testpred <- predict(model.dist, newdata=test)
Mean Squared Error
1�
(yi − yˆi )2
n
• The MSE is computed as
m
thismas a function
• In R we can implement
log θ = log
= ... = β
m
m
i
mse:
ij
i+1,j+1
ij
i+1
− βi
> mse <- function(y, i+1,j
yhat) i,j+1
{
+ n <- length(y)
mi+1,j+1 mij
log θij = log
= ... = β
+ return(sum((y-yhat)^2)/n)
mi+1,j mi,j+1
+ }
(s21 + s22 )2
>
k= 4
> mse(testpred, test$ArrDelay)
s1 /(n1 − 1) + s42 /(n2 − 1)
[1] 213.4085
X̄1 − X̄2 − d
�
s21 /n1 + s22 /n2
Extending the Model
• Include DayOfWeek in your model - how
many more coefficients do you have in your
model?
If you have only one more, try to use
factor.
• Is the extended model an improvement?
Get predictions for the test data set, and
compare MSE.
• What does anova do?
Interaction Effects
•
For three variables X,Y,Z an interaction effect is
present, if the association between X and Y is
different for different levels of Z
•
interaction present: groups have different slopes
no interaction: same slopes
-5
-6
-8
-10
DayOfWeek
DayOfWeek
1
1
2
3
-12
4
5
6
-14
fitted(model3)
fitted(model2)
-10
2
3
4
-15
5
6
7
7
-16
-20
-18
1000
2000
Distance
3000
4000
1000
2000
Distance
3000
4000
Interactions in R
• formula coding of interactions: * or :
• * includes all lower dimensional
interactions, too (preferable)
: include only the interaction term specified
Interactions
• How do Events affect arrival delays?
• Would you assume that the interaction of
Events and DayOfWeek is significant?
• Again, assess the extended model(s) both in
terms of anova and prediction gains.
Problematic residual
plots
0.2
0.3
0.2
0.1
resid(m)
resid(m)
0.1
0.0
0.0
-0.1
-0.1
-0.2
-0.2
-0.3
0.0
0.5
fitted(m)
1.0
1.5
0.2
0.4
fitted(m)
0.6
0.8
Your Turn
• Work on modeling arrival delays.
Download