Topic 14: Inference in Multiple Regression Outline • Review multiple linear regression • Inference of regression coefficients – Application to book example • Inference of mean – Application to book example • Inference of future observation • Diagnostics and remedies Data for Multiple Regression • Yi is the response variable • Xi1, Xi2, … , Xi,p-1 are the p-1 explanatory variables • Yi, Xi1, Xi2, … , Xi,p-1 are the data for case i, where i = 1 to n Multiple Regression Model • Yi = β0 + β1Xi1 + β2Xi2 +…+ βp-1Xi,p-1 + ei • Yi is the value of the response variable for the ith case • β0 is the intercept • β1, β2, … , βp-1 are the regression coefficients for the explanatory variables • ei are independent Normally distributed random errors with mean 0 and variance σ2 Least Squares Solutions b ( XX ) XY 1 2 s = MSE= Y(I H )Y /( n p ) s = Root MSE ANOVA F-test • H0: β1 = β2 = … = βp-1 = 0 • Ha: βk ≠ 0, for at least one k=1,2,…,p-1 • Under H0, F ~ F(p-1,n-p) • Reject H0 if F is large, using P-value we reject if the P-value ≤ 0.05 Inference for individual regression coefficients • We can show b ~ N(β, σ2(X΄X)-1) • Define s {b} MSE (XX ) 2 p p s {b k } s {b}k ,k 2 2 1 Significance Test for βk • • • • • H 0: β k = 0 Same test statistic t* = bk/s(bk) Still use dfE which now equals n-p P-value computed from t(n-p) dist This tests the significance of a variable given the other variables are already in the model (i.e., fitted last) Confidence interval for βk • CI: bk ± tcs(bk), where tc = t(.975, n-p) • Same form as before but dfE now equals n-p • This interval describes region of bk given the other variables are in the model Example II (KNNL p 236) • Dwaine Studios, Inc. operates portrait studios in 21 cities of medium size • Yi is sales in city i • X1 : population aged 16 and under • X2 : per capita disposable income Yi 0 1 X i 1 2 X i 2 i Read in the data data a1; infile ‘../data/ch06fi05.txt'; input young income sales; proc print data=a1; run; Partial Proc Print Results Obs young 1 2 3 4 5 68.5 45.2 91.3 47.8 46.9 income sales 16.7 16.8 18.2 16.3 17.3 174.4 164.4 244.2 154.6 181.6 Proc Reg proc reg data=a1; model sales=young income; run; Output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 24015 12008 99.10 <.0001 Error 18 2180.9274 121.1626 Corrected Total 20 26196 Root MSE 11.00739 R-Square 0.917 At least one variable is helpful in predicting in sales Output Variable DF Intercept 1 young 1 income 1 Parameter Estimates Parameter Standard Estimate Error t Value Pr > |t| -68.85707 60.01695 -1.15 0.2663 1.45456 0.21178 6.87 <.0001 9.36550 4.06396 2.30 0.0333 Both variables are helpful in explaining sales after the other is already in the model CLB option • Used to get confidence intervals for each coefficient proc reg data=a1; model sales=young income/clb; run; Output Parameter Estimates Variable Intercept young income DF 1 1 1 Parameter Standard 95% Confidence Estimate Error Limits -68.85707 60.01695 -194.94801 57.23387 1.45456 0.21178 1.00962 1.89950 9.36550 4.06396 0.82744 17.90356 What if just young fit? Parameter Estimates Variable Intercept young DF 1 Parameter Estimate 68.04536 1 1.83588 Standard 95% Confidence Error Limits 9.46224 48.24066 87.85006 0.14641 1.52943 CIs for both the intercept and young change dramatically when just young as explanatory variable 2.14233 Estimation of E(Yh) • Xh is now a vector that looks like (1, Xh1, Xh2, … , Xh,p-1)΄ • We want a point estimate and a confidence interval for the subpopulation mean corresponding to the set of explanatory variables Xh Theory for E(Yh) E(Yh ) h Xh ˆ h Xh b s ( ˆ h ) Xhs {b}X h s Xh (XX) X h 2 2 2 CI : ˆ h s( ˆ h ) t(0.975, n - p) -1 Using CLM option proc reg data=a1; model sales=young income/clm; id young income; run; Adds them to output table CLM Output Output Statistics Obs young income 1 68.5 16.7 Dependent Variable 174.4000 Predicted Value 187.1841 Std Error Mean Predict 95% CL Mean 3.8409 179.1146 195.2536 2 45.2 16.8 164.4000 154.2294 3.5558 146.7591 161.6998 3 91.3 18.2 244.2000 234.3963 4.5882 224.7569 244.0358 4 47.8 16.3 154.6000 153.3285 3.2331 146.5361 160.1210 5 46.9 17.3 181.6000 161.3849 4.4300 152.0778 170.6921 21 52.3 16.0 166.5000 157.0644 4.0792 148.4944 165.6344 Prediction of Yh • Xh is still a vector of form (1, Xh1, Xh2, … , Xh,p-1)΄ • We want a prediction of Yh based on a set of predictor values with an interval that expresses the uncertainty in our prediction Theory for Yh Yh Xh Ŷh ˆ h Xh b s (pred ) Var (Ŷh ) 2 Var (Ŷh ) Var ( ) s (1 Xh (XX) X h ) 2 -1 CI : ˆ h s (pred ) t(.975, n - p ) Using the CLI option proc reg data=a1; model sales=young income/cli; id young income; run; Adds them to output table CLI Output Output Statistics Dependent Predicted Std Error Obs young income Variable Value Mean Predict 95% CL Predict 1 68.5 16.7 174.4000 187.1841 3.8409 162.6910 211.6772 2 45.2 16.8 164.4000 154.2294 3.5558 129.9271 178.5317 3 91.3 18.2 244.2000 234.3963 4.5882 209.3421 259.4506 21 52.3 16.0 166.5000 157.0644 4.0792 132.4018 181.7270 Diagnostics • Look at the distribution of each variable • Look at the relationship between pairs of variables • Plot the residuals versus – the predicted/fitted values – each explanatory variable – time (if available) Diagnostics • Are the residuals approximately Normal – Look at a histogram – Normal quantile plot • Is the variance constant – Plot the residuals vs anything that might be related to the variance (e.g. residuals vs predicted values & residuals versus each X) Remedies • Similar remedies as simple regression • Transformations such as Box-Cox • Analyze with/without outliers • More detail in KNNL Ch 9 and 10 Background Reading • We finished Chapter 6. • Program used to generate output for confidence intervals for means and prediction intervals is topic14.sas