General Linear Models in R Stat 430 Outline • Model Comparison and Selection • Interaction Effects • Diagnostics • Logistic Regression Flights Project • Accuracy is measured as (lower is better) 1� (yi − yˆi )2 n i mi+1,j+1 mij log θij = log = ... = βi+1 − βi mi+1,j mi,j+1 • mi+1,j+1 m ij set into 90% What YOU should do: divide data log θij = log = ... = β mi,j+1 i+1,jData Training Set and 10%mTest Set • 2 + s2 ) 2 (s 2 Fit model on set,1 compute accuracy of your k =training 4 /(n − 1) + s4 /(n − 1) 1 2 model on test sset. 2 1 • X̄ X̄2data. −d 1− Compute accuracy on test � s21 /n1 + s22 /n2 How do we do that in R? • first: generate random sample using function sample • second: split data in training and test data Random samples in R > help(sample) > # sample is generating a random sample of size 'size' from values 1 through n > sample(10,4) [1] 4 6 2 3 > sample(10,4) [1] 2 4 3 1 > # repeated execution will give different values > > # in order to get the exact same random sequence, we need to set the random seed > set.seed(12345) > sample(10,4) [1] 8 10 7 9 > > set.seed(12345) > sample(10,4) [1] 8 10 7 9 Test and Training Data > # in order to get reproducible results it's generally a good idea to set the random seed > # sample 90% of the flights data > set.seed(987654) > idx <- sample(nrow(flights), 0.9*nrow(flights)) > train <- flights[idx,] > test <- flights[-idx,] Your Turn • load the flights data from "http:// www.public.iastate.edu/~hofmann/stat430/ project/flights-train.csv" • split the data into test and training data Model Fitting • lm fits a linear model • fit a linear model with dependent variable ArrDelay and independent variable Distance. How does Distance affect arrival delays? Use summary to investigate • investigate the resulting object using str how can we get a vector of coefficients out of the model object? Estimates and Intervals • • coef, vcov, confint, fitted, resid, and rstudent are functions important for working with (linear) models. Use help(command) to find out what each one of these functions does, use them to answer the questions below. Is Distance a variable that significantly contributes to the model (i.e. is the coefficient of Distance significantly different from 0?) Give a 90% confidence interval for the effect of Distance on arrival delays Estimates and Intervals • coef, vcov, confint, fitted, and resid are functions important for working with (linear) models. Use help(command) to find out what each one of these functions does, use them to answer the questions below. • Do fitted values and residuals show a twodimensional pattern? Draw a plot of residuals (on y) versus fitted values (on x). Use qplot Residual Plots: fitted vs residuals • under model assumptions fitted values are independent of residuals - we should not see trends or patterns • residuals should have same error variance we should see a “band” around zero of same height across Y • only 5% of residual values should be above 2 or below -2 Residual Plots: explanatory vs residuals • under model assumptions X values are independent of residuals - we should not see trends or patterns • residuals should have same error variance we should see a “band” around zero of same height across X • only 5% of residual values should be above 2 or below -2 Model Predictions • predict allows to get values from a model based on a new dataset: > help(predict) > testpred <- predict(model.dist, newdata=test) Mean Squared Error 1� (yi − yˆi )2 n • The MSE is computed as m thismas a function • In R we can implement log θ = log = ... = β m m i mse: ij i+1,j+1 ij i+1 − βi > mse <- function(y, i+1,j yhat) i,j+1 { + n <- length(y) mi+1,j+1 mij log θij = log = ... = β + return(sum((y-yhat)^2)/n) mi+1,j mi,j+1 + } (s21 + s22 )2 > k= 4 > mse(testpred, test$ArrDelay) s1 /(n1 − 1) + s42 /(n2 − 1) [1] 213.4085 X̄1 − X̄2 − d � s21 /n1 + s22 /n2 Extending the Model • Include DayOfWeek in your model - how many more coefficients do you have in your model? If you have only one more, try to use factor. • Is the extended model an improvement? Get predictions for the test data set, and compare MSE. • What does anova do? Interaction Effects • For three variables X,Y,Z an interaction effect is present, if the association between X and Y is different for different levels of Z • interaction present: groups have different slopes no interaction: same slopes -5 -6 -8 -10 DayOfWeek DayOfWeek 1 1 2 3 -12 4 5 6 -14 fitted(model3) fitted(model2) -10 2 3 4 -15 5 6 7 7 -16 -20 -18 1000 2000 Distance 3000 4000 1000 2000 Distance 3000 4000 Interactions in R • formula coding of interactions: * or : • * includes all lower dimensional interactions, too (preferable) : include only the interaction term specified Interactions • How do Events affect arrival delays? • Would you assume that the interaction of Events and DayOfWeek is significant? • Again, assess the extended model(s) both in terms of anova and prediction gains. Problematic residual plots 0.2 0.3 0.2 0.1 resid(m) resid(m) 0.1 0.0 0.0 -0.1 -0.1 -0.2 -0.2 -0.3 0.0 0.5 fitted(m) 1.0 1.5 0.2 0.4 fitted(m) 0.6 0.8 Your Turn • Work on modeling arrival delays.