Chapter 2 3.1 SLR Regress kids test score on mom’s high school (binary: 1 or 0). Interpret the intercept estimate (78) and slope estimate (12). xtable(lm(kid.score ~ mom.hs, kids)) Read Chapters 1-2 on your own, but note Section 2.5, a problem with significance testing: Study one: θb = 25 with SE = 10. Two-sided p-value = .012 Study two: θb = 10 with SE = 10. p-value = .30 b = 15 with SE = Compare study one to study two: ∆ √ 102 + 102 = 14.1 for 2-sided p-value = .29 Also 2.4: We don’t worry about “multiple testing” issues. (Intercept) mom.hs Estimate 77.5484 11.7713 Std. Error 2.0586 2.3224 t value 37.67 5.07 Pr(>|t|) 0.0000 0.0000 Regress kids test score on mom’s IQ (continuous). Interpret the intercept estimate (26) and slope estimate (0.6). xtable(lm(kid.score ~ mom.iq, kids)) (Intercept) mom.iq Stat 505 Gelman & Hill, Chapter 3 Std. Error 5.9174 0.0585 Stat 505 3.2 MLR t value 4.36 10.42 Pr(>|t|) 0.0000 0.0000 Gelman & Hill, Chapter 3 Interpretations: Using both predictors. Slope is “change when all other factors are held constant” (and are conditional on the other terms being in b = (26 6 0.6)T the model). β Holding other predictors constant does not make sense in many settings (e.g. polynomials, interactions). xtable(lm(kid.score ~ mom.hs + mom.iq, kids)) (Intercept) mom.hs mom.iq Estimate 25.7998 0.6100 Estimate 25.7315 5.9501 0.5639 Std. Error 5.8752 2.2118 0.0606 Stat 505 t value 4.38 2.69 9.31 Gelman & Hill, Chapter 3 Pr(>|t|) 0.0000 0.0074 0.0000 Prediction: If we look at another person or group where IQ is one pt higher, but HS was the same, how does predicted score change? or Change in prediction for HS vs nonHS at the same IQ? Counterfactuals: Imagine rewinding the clock, and change moms HS from 0 to 1. What is the effect on the same kid’s score? It’s a thought experiment. I can imagine it better if we’re assigning treatments. More in Chapters 9-10. Stat 505 Gelman & Hill, Chapter 3 3.3 Interactions 3.4 Inference Terminology Effect of IQ depends on the level of HS. Plug in 0’s for HS to get estimates for HS grads, 1’s give adjustments to intercept and slope. What if HS were a factor with levels “noHS” and “HS”? How would R code them? Units – of analysis (not of measurement) may be subjects Predictors Outcome (response) variable Note advice about finding interactions: tend to appear when main effects are large. Really?? Matrix notation Distributions: y ∼ N(Xβ, σ 2 V) with, for now, V = I. G & H have an R package called arm containing the data and functions like display Helps to center predictors about their mean. Stat 505 Gelman & Hill, Chapter 3 Stat 505 OLS estimator Gelman & Hill, Chapter 3 Centered X b = (XT X)−1 XT y is Var(β) b = σ 2 (XT X)−1 = σ 2 Vβ Variance of β If two predictors are linearly independent, what will the covariance of their estimated coefficients be? xtable(summary(fit1)) xtable(cbind(summary(fit1, cor=TRUE)$cor, summary(fit2, cor (Intercept) mom.iq Estimate 25.7998 0.6100 Std. Error 5.9174 0.0585 t value 4.36 10.42 Pr(>|t|) 0.0000 0.0000 Std. Error 0.8768 0.0585 t value 98.99 10.42 Pr(>|t|) 0.0000 0.0000 (Intercept) mom.iq (Intercept) 1.00 -0.99 mom.iq -0.99 1.00 (Intercept) 1.00 0.00 xtable(summary(fit2)) (Intercept) centeredIQ Estimate 86.7972 0.6100 Stat 505 Gelman & Hill, Chapter 3 Stat 505 Gelman & Hill, Chapter 3 centeredIQ 0.00 1.00 Centering Welcome to Bayesian Inference xtable(anova(fit1, fit2)) 1 2 Res.Df 432 432 RSS 144137.34 144137.34 Df Sum of Sq -0 0.00 F Pr(>F) Note in Figure 3.7, βb is known and labeled on the x axis. Frequentists would consider unknown β the center for the sampling b i.e. β b ∼ N(β, σ 2 Vβ ). Our authors use a distribution of β, b σ 2 Vβ ), our ignorance about the Bayesian interpretation, β ∼ N(β, location of true β is expressed as a posterior distribution. print(xtable(matrix(fivenum(predict(fit1) - predict(fit2)),nrow=1)), include.colnames=FALSE) 1 -0.00 -0.00 1T (x − 1x) = 1T x − 1T 1x = P Stat 505 -0.00 -0.00 0.00 xi − nx = 0 Gelman & Hill, Chapter 3 Variances Stat 505 Gelman & Hill, Chapter 3 Comparing Means Compute pairwise comparisons of breakage by tension level. For any estimable linear combination of the coefficients, the variance is: b = σ 2 λT V λ Var(λT β) β If λ is a single column what is the dimension? If Λ has four columns (each making an estimable linear combination) what are the dimensions and what are the components of b = Var(ΛT β) coef(warp.fit <- lm(breaks ~ tension, warpbreaks)) ## (Intercept) ## 36.4 tensionM -10.0 tensionH -14.7 Lambda <- matrix(c( 0,1,0, 0,0,1, 0,1,-1),byrow=TRUE,3,3) as.numeric(Lambda %*% coef(warp.fit)) ## [1] -10.00 -14.72 4.72 sqrt(diag( Lambda %*% vcov(warp.fit) %*% t(Lambda))) ## [1] 3.96 3.96 3.96 Stat 505 Gelman & Hill, Chapter 3 Stat 505 Gelman & Hill, Chapter 3 Another Generalized Inverse Two Fits coef(warp.fit2 <- lm(breaks ~ tension -1, warpbreaks)) ## tensionL tensionM tensionH ## 36.4 26.4 21.7 coef(warp.fit <- lm(breaks ~ tension, warpbreaks)) Lambda2 <- matrix(c( 1,-1,0, 1,0,-1, 0,1,-1),byrow=TRUE,3,3) as.numeric(Lambda2 %*% coef(warp.fit2)) ## (Intercept) ## 36.4 tensionM -10.0 tensionH -14.7 coef(warp.fit2 <- lm(breaks ~ tension -1, warpbreaks)) ## [1] 10.00 14.72 4.72 ## tensionL tensionM tensionH ## 36.4 26.4 21.7 Lambda2 %*% vcov(warp.fit2) %*% t(Lambda2) ## [,1] [,2] [,3] ## [1,] 15.68 7.84 -7.84 ## [2,] 7.84 15.68 7.84 ## [3,] -7.84 7.84 15.68 as.numeric(Lambda2 %*% coef(warp.fit2)) ## [1] 10.00 14.72 4.72 See how the two var-cov matrices compare: vcov(warp.fit) Stat 505 Gelman & Hill, Chapter 3 Compare the fits Stat 505 Gelman & Hill, Chapter 3 Residuals See how the two var-cov matrices compare: xtable(vcov(warp.fit)) (Intercept) tensionM tensionH (Intercept) 7.84 -7.84 -7.84 tensionM -7.84 15.68 7.84 tensionH -7.84 7.84 15.68 xtable(vcov(warp.fit2)) tensionL tensionM tensionH tensionL 7.84 -0.00 -0.00 Stat 505 tensionM -0.00 7.84 -0.00 tensionH -0.00 -0.00 7.84 Gelman & Hill, Chapter 3 b = y − Xβ b ri = yi − xi β P s2 = σ b2 = T /(n − k) = ri2 /(n − k) where n is the number of rows and k is the rank of X. 2 The sampling distribution of σσb2 is χ2n−k . 2 R 2 = 1 − σbs 2 is proportion of variance of y explained by the model. y The model must contain an intercept. p 42 Do the authors drop predictors with coefficients close to 0 (in SE’s)? Stay tuned for §4.6 Stat 505 Gelman & Hill, Chapter 3 Drawing Lines Interaction Lines kidfit.4 <- lm (kid.score ~ mom.hs * mom.iq, kids) plot(kid.score ~ mom.iq, data=kids,xlab="Moms IQ", ylab="Kids Score", col= mom.hs+2, pch=20) curve (cbind (1, 1, x, 1*x) %*% coef(kidfit.4), add=TRUE, col="darkgreen",lwd=2) curve (cbind (1, 0, x, 0*x) %*% coef(kidfit.4), add=TRUE, col="red",lwd=2) 140 140 kidfit.2 <- lm (kid.score ~ mom.iq, kids) plot ( kid.score~mom.iq,data=kids, xlab="Moms IQ", ylab="Kids Score", col= mom.hs+2, pch=20) curve (coef(kidfit.2)[1] + coef(kidfit.2)[2]*x, add=TRUE, col="blue") fit.3 <- lm (kid.score ~ mom.hs + mom.iq, kids) curve (cbind (1, 1, x) %*% coef(fit.3), add=TRUE, col="darkgreen",lwd=2) curve (cbind (1, 0, x) %*% coef(fit.3), add=TRUE, col="red",lwd=2) ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ●●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ●●● ● ● ●● ●●● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ●● ●● ● ● ● ●●● ● ● ● ●● ●● ● ●● ● ●●●●● ●● ● ● ● ● ●● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●●●●● ●●● ● ● ● ● ●● ●● ● ● ●● ●●● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ●● ●●● ● ● ● ● ●● ●● ● ●● ● ● ●●● ●● ● ● ● ●●● ● ● ● ●●●● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ●●●● ● 120 140 20 q Moms IQ For each row of data, SE(ŷi )) =s xi Vβ xi T kid.score ● ● ● 120 ● 120 130 140 100 60 20 Kids Score How well do we predict new points? Moms IQ 0 160 1 ● 120 140 mom.iq Stat 505 110 mom.hs ● 100 100 predictDF <- data.frame(mom.iq=rep(7:14*10,2), mom.hs=rep(0:1,each=8)) predictDF <- cbind(predictDF, predict(kidfit.4, newdata=predictDF, inte names(predictDF)[3] <- "kid.score" predictDF$mom.hs <- factor(predictDF$mom.hs ) myplot + geom_smooth(aes(ymin = lwr, ymax = upr), data = predictDF, s ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 90 Gelman & Hill, Chapter 3 Aqnew point has extra variance (σ̂ 2 so now the predicitive error is s 1 + xi Vβ xi T 150 80 Stat 505 Gelman & Hill, Chapter 3 kid.score 100 60 Kids Score 70fit of 80the 90 How wobbly is the line? 100 50 ● 70 Uncertainty 2 80 Uncertainty 100 ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●● ●●● ● ● ● ●●●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●●● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●● ● ● ● ● ●●● ●● ● ● ●●● ● ●●● ●● ●● ● ●●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ●● ●●●● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ●●● ●● ●●● ● ● ● ● ●● ●●● ● ●● ●●● ● ●● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● Gelman & Hill, Chapter 3 ● ● ● ● Stat 505 ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● Stat 505 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● mom.hs ● ● 0 1 Gelman & Hill, Chapter 3 The Usual Assumptions Diagnostic plots par(mfrow=c(1,4)) plot(kidfit.4) Normality of errors. Never check raw y for normality. Residuals vs Leverage 273 ● 111 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●●● ●● ●● ● ● ● ● ● ● ● ● ●● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ●● ●●● ● ●● ● ● ● ●● ●● ● ●● ●● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●●● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●●●● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● 80 90 Fitted values 100 110 −3 −2 −1 0 1 2 3 60 70 Theoretical Quantiles 80 90 Fitted values 100 ● 213 ● ● ●● ●● ● ●●● ● ● ● ●● ●● ● ●● ● ●● ● ●●● ●● ● ●●● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● 2 ●● 110 1 0 −1 ● Standardized residuals ● −2 1.5 1.0 0.5 Standardized residuals ● ● −3 111 ● 273 ● ● 286 70 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● 111 286 ●● ● 273 0.0 2 1 0 −1 Standardized residuals 20 −40 ● 60 5 ● −3 Constant variance = homo– not hetero–scedastic 0 Residuals 4 −20 Independent errors. ● ●●● ● −2 Model is valid – properly specified. 3 Scale−Location ● 286 ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ●●●● ● ●● ● ● ●●● ● ●●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ●● ● ● ●● ● ●● ● ●● ● ● ●● ●●●●●● ● ● ●●● ● ● ● ● ●●● ●●●●● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ●●●●● ●●● ●● ● ● ●●● ● ● ●● ● ●●●● ● ● ●● ● ●●● ● ● ● ● ● ●●● ● ●● ● ●● ● ●●● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ●●● ●● ●● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ●●●● ● ● ●● ● ● ●● ● ● ● ●●●●● ● ● ●● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●●●● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● −60 2 Normal Q−Q 3 Residuals vs Fitted 3 Data are valid – will answer our question. 40 1 ● ● 286 ● ● ● ● ● ● ● ● 111 ● Cook's distance 0.00 0.02 0.04 0.06 0.08 0.10 Leverage Check validity with new data. (extrapolate?) Data used to create the estimated model always fits it better than data which got no “say” in estimation. Later: “Can the model generate data like the data we see?” Stat 505 Gelman & Hill, Chapter 3 Stat 505 Gelman & Hill, Chapter 3