UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Regression Analysis-- Prediction/Forecasting Forecasting Y Using the Sample Regression Equation After we conduct a regression analysis, we often want to use our regression equation to predict/forecast the value of the dependent variable (Y) for various values of the independent variables (the X’s). For example, suppose we want to investigate the relationship between dependent variable Y and a single independent variable X1. The true relationship between Y and X1 for the individuals in the population under study is: π = π½0 + π½1 β π1 + π where e is a normally-distributed, random error term. Suppose further that we don’t have data on all individuals in the population; instead, we just have data on a sample of individuals. Using regression analysis, we find the following estimate of the relationship between Y and X1, based on our sample data: πΜ = π½Μ0 + π½Μ1 β π1 Μ . After we have The “hat” symbol is placed above variable Y to indicate that it is a forecast/predicted value, π used the OLS Estimator Equations to estimate π½Μ0 and π½Μ1 , we can insert them into the sample regression equation and forecast/predict πΜ for a given value of X1. For example, suppose that π½Μ0 =10, π½Μ1 = 2, and suppose that we want to forecast/predict Y when X1 = 40. Then: πΜ = π½Μ0 + π½Μ1 β π1 πΜ = 10 + (2 β 40) πΜ = 90 (πΜ = 90 is our forecast/prediction of Y when X1 = 40) Μ , there will be However, because there is error in our estimates of π½Μ0 and π½Μ1 , and π½Μ0 and π½Μ1 are used to forecast Y Μ. We measure the potential error in our forecast of Y Μ by estimating the error in our forecast/prediction of Y Μ variance, standard error (s.e.), and confidence interval for the forecast π, as shown below: Μ (for a given value πΏπ ) Variance of the Forecast π π£ππ(πΜπ ) = ππ2 β [1 + (π1 − πΜ )2 1 + ] π ∑π(π1π − πΜ )2 Note: Because e is unknown, ππ2 is also unknown. So, we estimate ππ2 based on our sample data. First, we calculate the estimated errors, or “residuals,” Μπ – Yi for each individual in the sample, then we calculate the variance of πΜπ = π 2 2 Μ2 the πΜπ ’s, denoted πΜ π , and then we substitute ππ for ππ in the equation above to find: (π1 − πΜ )2 1 2 β [1 + + π£ππ(πΜπ ) = πΜ ] π π ∑π(π1π − πΜ )2 1 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Μ Standard Error of the Forecast π Note: Don’t confuse SER with s.e.( πΜ ). SER is the variation in Y around the regression s.e.( πΜ ) = √π£ππ(πΜ ) line on average for all the X’s, whereas s.e.( πΜ ) is the variation in Y around the regression line for the particular X for which we are forecasting. Μ Confidence Interval for the Forecast π Confidence Interval for πΜ = πΜ +/- (tcritical,α/2·s.e.( πΜ)) The values for tcritical are found from the t-table using α/2 (because a Confidence Interval is two-sided test) and d.f. = n – k, where n = sample size, and k = the number of β’s in the regression equation. The graph below shows the Confidence Intervals for πΜ for all values of X1. Notice that the Confidence Interval is narrowest (that is, the forecasts have less error) for the mean value of X1 (that is, near πΜ 1 ). For values of X1 larger or smaller than πΜ 1, the Confidence Intervals grow much wider, indicating more error in the forecasts of Y. Μ Confidence Interval for the Forecast π Upper Confidence Interval πΜ = π½Μ0 + π½Μ1 β π1 Y πΜ + (tcritical,α/2·s.e.( πΜ)) πΜ Lower Confidence Interval πΜ - (tcritical,α/2·s.e.( πΜ)) πΜ 1 X1 2 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Prediction and Confidence Intervals in SAS In SAS, PROC REG can be used to calculate the predicted values, πΜ, and the confidence intervals for the regression line/curve. Within PROC REG, the “output” statement is used to create a new dataset, and the predicted values and confidence interval values are placed in the new dataset. In the PROC REG statement below, the model command produces a regression of variable y on variable x1, and then and “output” command is used to create a new dataset03. When dataset03 is created, the dataset on which the regression was based (dataset02 in the example below) is automatically copied into dataset03. In addition, “p=yhat” creates a new variable called “yhat” and sets it equal to the predicted values, the πΜ’s, from the regression. This new yhat variable is added to dataset03. Commands “lclm=lower_ci” and “uclm=upper_ci” create new variables “lower_ci” and “upper_ci” and add them to dataset03. These variables contain the upper and lower confidence interval numbers for each point along the regression line. proc reg data=dataset02; model Y =X1; output out=dataset03 p=yhat lclm=lower_ci uclm=upper_ci ; run; Now we can use the newly-created dataset03 together with PROC GPLOT to make a graph of our data points, the regression line/curve, and the 95% Confidence Intervals. The SAS commands are show below, along with an illustration of the resulting graph. proc gplot data=dataset03; plot Y*X1 yhat*X1 upper_ci*X1 lower_ci*X1 / overlay; run; Data Points, Regression Line, and Confidence Intervals Y Upper Confidence Interval Regression Line Lower Confidence Interval X1 3