STATS 330: Lecture 7 3/22/2016 330 lecture 7 1 Prediction Aims of today’s lecture Describe how to use the regression model to predict future values of the response, given the values of the covariates Describe how to estimate the mean response, for particular values of the covariates 3/22/2016 330 lecture 7 2 Typical questions Given the height and diameter, can we predict the volume of a cherry tree? E.g.if a particular tree has diameter 11 inches and height 85 feet, can we predict its volume? What is the average volume of all cherry trees having a given height and diameter? E.g. if we consider all cherry trees with diameter 11 inches and height 85 feet, can we estimate their mean volume? 3/22/2016 330 lecture 7 3 The regression model (revisited) The diagram shows the true plane, which describes the mean response for any x1, x2 The short “sticks” represent the “errors” The y-observation is the sum of the mean and the error, Y= m + e 3/22/2016 330 lecture 7 4 Prediction: how to do it Suppose we want to predict the response of an individual whose covariates are x1, … xk. for the cherry tree example, k=2 and x1 (height) is 85, x2 (diameter) is 11 Since the response we want to predict is Y= m + e, we predict Y by making separate predictions of m and e, and adding the results together. 3/22/2016 330 lecture 7 5 Prediction: how to do it (2) Since m is the value of the true regression plane at x1,…xk, we predict m by the value of the fitted plane at x1,…xk. This is ˆ0 + ˆ1 x1 + ... + ˆk xk We have no information about e other than the fact that it is sampled from a normal distribution with zero mean. Thus, we predict e by zero. 3/22/2016 330 lecture 7 6 Prediction: how to do it (3) Thus, the value of the predictor is ˆ0 + ˆ1 x1 + ... + ˆk xk + 0 ˆ0 + ˆ1 x1 + ... + ˆk xk Thus, to predict the response y when the covariates are x1,…xk, we use the fitted plane at x1,…xk.as the predictor. 3/22/2016 330 lecture 7 7 Inner product form Inner product form of predictor: ˆ ˆ ˆ ˆ ( 0 , 1 ,..., k ) Inner product Vector of regression coefficients x (1, x1 ,..., xk ) ˆ ˆ ˆ ˆ x' 0 + 1 x1 + ... + k xk 3/22/2016 Vector of predictor variables 330 lecture 7 8 Prediction error Prediction error is measured by the expected value of (actual-predictor)2 Equal to Var (predictor) + s2 • First part comes from predicting the mean, second part from predicting the error. • Var(predictor) depends on the x’s and on s2, see formula on slide 12 Standard error of prediction is square root of Var (predictor) + s2 Prediction intervals approximately predictor +/- 2 standard errors, or more precisely…. 3/22/2016 330 lecture 7 9 Prediction interval A prediction interval is an interval which contains the actual value of the response with a given probability e.g. 0.95 95% Interval is for x’ is x' ( Standard error of prediction) t nk 1 (0.975) 3/22/2016 330 lecture 7 10 Var(predictor) Arrange covariate data (used for fitting model) into a matrix X 1 x11 x1k X 1 x n1 xnk Var ( predictor ) s x' ( X' X) x 2 3/22/2016 330 lecture 7 1 11 qt 0.4 Density of t-distribution with 28 df 0.2 Density 0.3 Area = 0.975 0.0 0.1 Area = 0.025 -4 -2 0 2 4 t qt(0.975,28) 3/22/2016 330 lecture 7 12 Doing it in R Suppose we want to predict the volume of 2 cherry trees. The first has diameter 11 inches and height 85 feet, the second has diameter 12 inches and height 90 feet. First step is to make a data frame containing the new data. Names must be the same as in the original data. Then we need to combine the results of the fit (regression coefs, estimate of error variance) with the new data (the predictor variables) to calculate the predictor 3/22/2016 330 lecture 7 13 Doing it in R (2) # calculate the fit from the original data cherry.lm =lm(volume~diameter+height,data=cherry.df) # make the new data frame new.df = data.frame(diameter=c(11,12), height=c(85,90)) # do the prediction Results of fit predict(cherry.lm,new.df) 1 2 22.63846 29.04288 New data (predictor variables) This only gives the predictions. What about prediction errors, prediction intervals etc? 3/22/2016 330 lecture 7 14 Doing it in R (3) predict(cherry.lm,new.df, se.fit=T,interval="prediction") $fit 95% Prediction fit lwr upr intervals 1 22.63846 13.94717 31.32976 2 29.04288 19.97235 38.11340 $se.fit Sqrt of var(predictor) 1 2 not standard error of 1.712901 2.130571 prediction $df [1] 28 Residual mean square $residual.scale [1] 3.881832 3/22/2016 330 lecture 7 15 Hand calculation Prediction is 22.63846 SE(predictor) = sqrt(se.fit^2 + residual.scale^2) Prediction interval is: Prediction +/- SE(predictor) *qt(0.975,n-k-1) Hence SE(predictor) = sqrt(1.712901^2 + 3.881832^2) = 4.242953 qt(0.975,n-k-1) = qt(0.975,28) = 2.048407 PI is 22.63846 +\- 4.242953 * 2.048407 or (13.94717, 31.32975) 3/22/2016 330 lecture 7 16 Estimating the mean response Suppose we want to estimate the mean response of all individuals whose covariates are x1, … xk. Since the mean we want to predict is the height of the true regression plane at x1, … xk, we use as our estimate the height of the fitted plane at x1, …, xk. This is ˆ0 + ˆ1 x1 + ... + ˆk xk 3/22/2016 330 lecture 7 17 Standard error of the estimate The standard error of the estimate is the square root of Var ( Predictor ) Note that this is less than the standard error of the prediction: prediction SE Var ( Predictor ) + s 2 Estimate SE Var ( Predictor ) Confidence intervals: predictor +/- 2 standard error of estimate * qt(0.975,n-k-1) 3/22/2016 330 lecture 7 18 Doing it in R Suppose we want to estimate the mean volume of all cherry trees having diameter 11 inches and height 85 feet. As before, the first step is to make a data frame containing the new data. Names must be the same as in the original data. 3/22/2016 330 lecture 7 19 Doing it in R (2) > new.df<-data.frame(diameter=11,height=85) > predict(cherry.lm,new.df,se.fit=T, interval="confidence") $fit fit lwr upr [1,] 22.63846 19.12974 26.14718 Confidence interval $se.fit Note: shorter than prediction interval [1] 1.712901 $df [1] 28 $residual.scale [1] 3.881832 3/22/2016 330 lecture 7 20 Example: Hydrocarbon data When petrol is pumped into a tank, hydrocarbon vapours are forced into the atmosphere. To reduce this significant source of air pollution, devices are installed to capture the vapour. A laboratory experiment was conducted in which the amount of vapour given off was measured under carefully controlled conditions. In addition to the response, there were four variables which were thought relevant for prediction: t.temp: initial tank temperature ( degrees F) p.temp: temperature of the dispensed petrol ( degrees F) t.vp: initial vapour pressure in tank (psi) p.vp: vapour pressure of the dispensed petrol (psi) hc: emitted dispensed hydrocarbons (g) (response) 3/22/2016 330 lecture 7 21 The data > petrol.df t.temp p.temp 1 28 33 2 24 48 3 33 53 4 29 52 5 33 44 6 31 36 7 32 34 8 34 35 9 33 51 t.vp 3.00 2.78 3.32 3.00 3.28 3.10 3.16 3.22 3.18 p.vp 3.49 3.22 3.42 3.28 3.58 3.26 3.16 3.22 3.18 hc 22 27 29 29 27 24 23 23 26 … 125 observations in all 3/22/2016 330 lecture 7 22 Pairs plot 60 80 3 4 5 6 7 70 90 40 80 30 50 t.temp 6 7 40 60 p.temp 6 7 3 4 5 t.vp 20 30 40 50 3 4 5 p.vp hc 30 3/22/2016 50 70 90 3 4 5 6 7 330 lecture 7 20 30 40 50 23 Preliminary conclusion All variables seem related to the response p.vp and t.vp seem highly correlated Quite strong correlations between some of the other variables No obvious outliers 3/22/2016 330 lecture 7 24 Fit the “full” model >petrol.reg<-lm(hc~ t.temp + p.temp + t.vp + p.vp, data=petrol.df) > summary(petrol.reg) Call: lm(formula = hc ~ t.temp + p.temp + t.vp + p.vp, data = petrol.df) Residuals: Min 1Q -6.46213 -1.31388 Median 0.03005 3Q 1.23138 Max 6.99663 Continued on next slide... 3/22/2016 330 lecture 7 25 Fit the “full” model (2) Coefficients: Estimate Std. Error t value (Intercept) 0.16609 1.02198 0.163 t.temp -0.07764 0.04801 -1.617 p.temp 0.18317 0.04063 4.508 t.vp -4.45230 1.56614 -2.843 p.vp 10.27271 1.60882 6.385 --Signif. codes: 0 `***' 0.001 `**' 0.01 Pr(>|t|) 0.87117 0.10850 1.53e-05 *** 0.00526 ** 3.37e-09 *** `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 2.723 on 120 degrees of freedom Multiple R-Squared: 0.8959, Adjusted R-squared: 0.8925 F-statistic: 258.2 on 4 and 120 DF, p-value: < 2.2e-16 3/22/2016 330 lecture 7 26 Conclusions Large R2 Significant coefficients, except for temp Model seems satisfactory Can move on to prediction Lets predict hydrocarbon emissions when t.temp=28, p.temp=30, t.vp=3 and p.vp=3.5 3/22/2016 330 lecture 7 27 Prediction > pred.df<-data.frame(t.temp=28, p.temp=30, t.vp=3, p.vp=3.5) > predict(petrol.reg,pred.df,interval="p" ) fit lwr upr [1,] 26.08471 20.23045 31.93898 Hydrocarbon emission is predicted to be between 20.2 and 31.9 grams. 3/22/2016 330 lecture 7 28