Topic 12: Multiple Linear Regression Outline • Multiple Regression – Data and notation – Model – Inference • Recall notes from Topic 3 for simple linear regression Data for Multiple Regression • Yi is the response variable • Xi1, Xi2, … , Xi,p-1 are p-1 explanatory (or predictor) variables • Cases denoted by i = 1 to n Multiple Regression Model Yi 0 1 X i1 2 X i 2 ... p1 X i, p1 i • Yi is the value of the response variable for the ith case • β0 is the intercept • β1, β2, … , βp-1 are the regression coefficients for the explanatory variables Multiple Regression Model Yi 0 1 X i1 2 X i 2 ... p1 X i, p1 i • Xi,k is the value of the kth explanatory variable for the ith case • ei are independent Normally distributed random errors with mean 0 and variance σ2 Multiple Regression Parameters • β0 is the intercept • β1, β2, … , βp-1 are the regression coefficients for the explanatory variables • σ2 the variance of the error term Interesting special cases • Yi = β0 + β1Xi + β2Xi2 +…+ βp-1Xip-1+ ei (polynomial of order p-1) • X’s can be indicator or dummy variables taking the values 0 and 1 (or any other two distinct numbers) • Interactions between explanatory variables (represented as the product of explanatory variables) Interesting special cases • Consider the model Yi= β0 + β1Xi1+ β2Xi2+β3X i1Xi2+ ei • If X2 a dummy variable – Yi = β0 + β1Xi + ei (when X2=0) – Yi = β0 + β1Xi1+β2+β3Xi1+ ei (when X2=1) = (β0+β2) + (β1+β3)Xi1+ ei – Modeling two different regression lines at same time Model in Matrix Form Y X β n1 np p1 n1 ~N(0, I ) 2 nn Y ~ N( X , I ) 2 nn Least Squares Find b to minimize SSE ( Y Xb )( Y Xb ) Obtain normal equations XXb XY Least Squares Solution b ( XX) XY 1 Fitted (predicted) values 1 ˆ Y Xb X( XX) XY HY Residuals ˆ eYY Y HY ( I H )Y IH is symetric and idempotent (I H)(I H) (I H) Covariance Matrix of residuals • Cov(e)=σ2(I-H)(I-H)΄= σ2(I-H) • Var(ei)= σ2(1-hii) • hii= X΄i(X΄X)-1Xi • X΄i =(1,Xi1,…,Xi,p-1) • Residuals are usually correlated • Cov(ei,ej)= -σ2hij Estimation of σ s 2 n p (Y Xb)( Y Xb) n p SSE MSE df E s s Root MSE 2 Distribution of b • b = (X΄X)-1X΄Y • Since Y~N(Xβ, σ2I) • E(b)=((X΄X)-1X΄)Xβ=β • Cov(b)=σ2 ((X΄X)-1X΄)((X΄X)-1X΄)΄ =σ2(X΄X)-1 • σ2 (X΄X)-1 is estimated by s2 (X΄X)-1 ANOVA Table • Sources of variation are – Model (SAS) or Regression (KNNL) – Error (SAS, KNNL) or Residual – Total • SS and df add as before – SSM + SSE =SSTO – dfM + dfE = dfTotal Sums of Squares n ˆ Y SSM Y i i 1 2 n 2 ˆ SSE (Yi Yi ) i 1 n SSTO (Yi Y) i 1 2 Degrees of Freedom dfM dfE dfTotal p -1 n- p n -1 Mean Squares MSM SSM/dfM MSE SSE/df E MST SSTO/df Total Mean Squares n ˆ Y / ( p 1) MSM Y i i 1 2 n 2 ˆ MSE (Yi Yi ) / (n p) i 1 n MST (Yi Y) / (n 1) 2 i 1 ANOVA Table Source SS Model SSM df dfM MS F MSM MSM/MSE Error SSE dfE MSE Total SSTO dfTotal MST ANOVA F test • H0: β1 = β2 = … = βp-1 = 0 • Ha: βk ≠ 0, for at least one k=1,., p-1 • Under H0, F ~ F(p-1,n-p) • Reject H0 if F is large, use P-value P-value of F test • The P-value for the F significance test tells us one of the following: – there is no evidence to conclude that any of our explanatory variables can help us to model the response variable using this kind of model (P ≥ .05) – one or more of the explanatory variables in our model is potentially useful for predicting the response variable in a linear model (P ≤ .05) R2 • The squared multiple regression correlation (R2) gives the proportion of variation in the response variable explained by all the explanatory variables • It is usually expressed as a percent • It is sometimes called the coefficient of multiple determination (KNNL p 226) R2 • R2 = SSM/SST – the proportion of variation explained • R2 = 1 – (SSE/SST) – 1 – the proportion not explained • Can express F test is terms of R2 F = [ (R2)/(p-1) ] / [ (1- R2)/(n-p) ] Background Reading • We went over KNNL 6.1 - 6.5