Regression Analysis--Prediction/Forecasting

advertisement
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Regression Analysis-- Prediction/Forecasting
Forecasting Y Using the Sample Regression Equation
After we conduct a regression analysis, we often want to use our regression equation to predict/forecast the
value of the dependent variable (Y) for various values of the independent variables (the X’s). For example,
suppose we want to investigate the relationship between dependent variable Y and a single independent variable
X1. The true relationship between Y and X1 for the individuals in the population under study is:
π‘Œ = 𝛽0 + 𝛽1 βˆ™ 𝑋1 + 𝑒
where e is a normally-distributed, random error term. Suppose further that we don’t have data on all individuals
in the population; instead, we just have data on a sample of individuals. Using regression analysis, we find the
following estimate of the relationship between Y and X1, based on our sample data:
π‘ŒΜ‚ = 𝛽̂0 + 𝛽̂1 βˆ™ 𝑋1
Μ‚ . After we have
The “hat” symbol is placed above variable Y to indicate that it is a forecast/predicted value, 𝒀
used the OLS Estimator Equations to estimate 𝛽̂0 and 𝛽̂1 , we can insert them into the sample regression equation
and forecast/predict π‘ŒΜ‚ for a given value of X1.
For example, suppose that 𝛽̂0 =10, 𝛽̂1 = 2, and suppose that we want to forecast/predict Y when X1 = 40. Then:
π‘ŒΜ‚ = 𝛽̂0 + 𝛽̂1 βˆ™ 𝑋1
π‘ŒΜ‚ = 10 + (2 βˆ™ 40)
π‘ŒΜ‚ = 90
(π‘ŒΜ‚ = 90 is our forecast/prediction of Y when X1 = 40)
Μ‚ , there will be
However, because there is error in our estimates of 𝛽̂0 and 𝛽̂1 , and 𝛽̂0 and 𝛽̂1 are used to forecast Y
Μ‚. We measure the potential error in our forecast of Y
Μ‚ by estimating the
error in our forecast/prediction of Y
Μ‚
variance, standard error (s.e.), and confidence interval for the forecast π‘Œ, as shown below:
Μ‚ (for a given value π‘ΏπŸ )
Variance of the Forecast 𝒀
π‘£π‘Žπ‘Ÿ(π‘ŒΜ‚π‘– ) = πœŽπ‘’2 βˆ™ [1 +
(𝑋1 − 𝑋̅)2
1
+
]
𝑛 ∑𝑖(𝑋1𝑖 − 𝑋̅)2
Note: Because e is unknown, πœŽπ‘’2 is also unknown. So, we estimate πœŽπ‘’2 based on
our sample data. First, we calculate the estimated errors, or “residuals,”
̂𝑖 – Yi for each individual in the sample, then we calculate the variance of
𝑒̂𝑖 = π‘Œ
2
2
Μ‚2
the 𝑒̂𝑖 ’s, denoted πœŽΜ‚
𝑒 , and then we substitute πœŽπ‘’ for πœŽπ‘’ in the equation above to
find:
(𝑋1 − 𝑋̅)2
1
2 βˆ™ [1 + +
π‘£π‘Žπ‘Ÿ(π‘ŒΜ‚π‘– ) = πœŽΜ‚
]
𝑒
𝑛 ∑𝑖(𝑋1𝑖 − 𝑋̅)2
1
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Μ‚
Standard Error of the Forecast 𝒀
Note: Don’t confuse SER with s.e.( π‘ŒΜ‚ ).
SER is the variation in Y around the regression
s.e.( π‘ŒΜ‚ ) = √π‘£π‘Žπ‘Ÿ(π‘ŒΜ‚ ) line on average for all the X’s, whereas s.e.( π‘ŒΜ‚ ) is
the variation in Y around the regression line for
the particular X for which we are forecasting.
Μ‚
Confidence Interval for the Forecast 𝒀
Confidence Interval for π‘ŒΜ‚
=
π‘ŒΜ‚ +/- (tcritical,α/2·s.e.( π‘ŒΜ‚))
The values for tcritical are found from the t-table using α/2 (because a Confidence Interval is two-sided test) and
d.f. = n – k, where n = sample size, and k = the number of β’s in the regression equation.
The graph below shows the Confidence Intervals for π‘ŒΜ‚ for all values of X1. Notice that the Confidence Interval is
narrowest (that is, the forecasts have less error) for the mean value of X1 (that is, near 𝑋̅1 ). For values of X1
larger or smaller than 𝑋̅1, the Confidence Intervals grow much wider, indicating more error in the
forecasts of Y.
Μ‚
Confidence Interval for the Forecast 𝒀
Upper Confidence Interval
π‘ŒΜ‚ = 𝛽̂0 + 𝛽̂1 βˆ™ 𝑋1
Y
π‘ŒΜ‚ + (tcritical,α/2·s.e.( π‘ŒΜ‚))
π‘ŒΜ‚
Lower Confidence Interval
π‘ŒΜ‚ - (tcritical,α/2·s.e.( π‘ŒΜ‚))
𝑋̅1
X1
2
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Prediction and Confidence Intervals in SAS
In SAS, PROC REG can be used to calculate the predicted values, π‘ŒΜ‚, and the confidence intervals for the
regression line/curve. Within PROC REG, the “output” statement is used to create a new dataset, and the
predicted values and confidence interval values are placed in the new dataset. In the PROC REG statement
below, the model command produces a regression of variable y on variable x1, and then and “output” command is
used to create a new dataset03. When dataset03 is created, the dataset on which the regression was based
(dataset02 in the example below) is automatically copied into dataset03. In addition, “p=yhat” creates a new
variable called “yhat” and sets it equal to the predicted values, the π‘ŒΜ‚’s, from the regression. This new yhat
variable is added to dataset03. Commands “lclm=lower_ci” and “uclm=upper_ci” create new variables
“lower_ci” and “upper_ci” and add them to dataset03. These variables contain the upper and lower
confidence interval numbers for each point along the regression line.
proc reg data=dataset02;
model Y =X1;
output out=dataset03 p=yhat lclm=lower_ci uclm=upper_ci ;
run;
Now we can use the newly-created dataset03 together with PROC GPLOT to make a graph of our data points, the
regression line/curve, and the 95% Confidence Intervals. The SAS commands are show below, along with an
illustration of the resulting graph.
proc gplot data=dataset03;
plot Y*X1 yhat*X1 upper_ci*X1 lower_ci*X1 / overlay;
run;
Data Points, Regression Line, and Confidence Intervals
Y
Upper Confidence Interval
Regression Line
Lower Confidence Interval
X1
3
Download