Excel--OLS Regression Analysis

advertisement
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Excel—OLS Regression Analysis
Recall that Regression Analysis is a method for using data to find the best estimates of the possible relationships
among variables. Typically, we are interested in explaining, forecasting or predicting the behavior of one or more
variables, called the dependent, or “Y,” variables, based on the behavior of one or more independent, or “X,”
variables. This handout will explain how to conduct ordinary least-squares (OLS) regression analysis (both
simple and multiple regression analysis) in Microsoft Excel 2013.
NOTE: It is assumed in this handout that the Analysis Toolpak "Add-In" has been activated in Excel. An
"Add-In" is an extra feature of Excel that is not active by default, so you must activate it. (To activate the Data
Analysis "Add-In" in Excel, start Excel, and then go to the File tab at the top of the Excel window, then select
"Options" on the left, and then "Add-Ins" on the left. Next, at the bottom of the window, in the box to the right of
"Manage", select "Excel Add-ins", then click the "Go" button, then check the box beside "Analysis ToolPak" in
the pop-up window, then click the "OK" button. After doing this, you might need to re-start Excel to activate the
Add-In.)
The OLSdata.xls Dataset
This handout will use the OLSdata.xls dataset as an example. Go to the Handouts page of the ECN377 website
and download the OLSdata.xls dataset to the ECN377 folder on the C: drive of your computer. The OLSdata.xls
dataset contains 26 observations on 3 variables. The three variables are named Y, X1 and X2, and the variable
names are in the first row of the dataset.
Open the OLSdata.xls Dataset in Excel
The first four rows should look like this (the variable names are on the first row):
Conducting an OLS Regression Analysis for One X Variable—Simple OLS Regression
With the OLSdata.xls dataset open in Excel:
 Select the Data tab at the top of the Excel window
 Select "Data Analysis" (on the right). (Note: If you don't have a "Data Analysis" button, then you need
to add the Analysis Toolpak Add-In to Excel. See above.)
 Select "Regression" and click "OK." You will see the "Regression" pop-up window in Excel.
 As an example, suppose we want to conduct a simple OLS regression analysis of variable Y against
variable X1. Click inside the "Input Y Range" box, and then select cells A1 to A27. This tells Excel
which data to use for the Y variable in the regression analysis.
 Click inside the "Input X Range" box, and then select cells B1 to B27. This tells Excel which data to use
for the X variable in the regression analysis.
 Check the box "Labels," because the labels for our variables are in the first row of the spreadsheet. (If
the first row of the spreadsheet did not contain the variable names, but instead simply gave the first row
of data, then we would not check the "Labels" box.)
 By default, Excel will use a confidence level of 95% for all hypothesis tests related to the regression. If
you want to change the confidence level, you can do so by checking the "Confidence Level" box and
entering the confidence level that you want. (For this example, let's use the default 95%, so don't check
the box.)
1
UNC-Wilmington
ECN 377
Department of Economics and Finance
Dr. Chris Dumas
 Check the button beside "Output Range," and then click inside the box to the right of "Output Range."
Then, click on an empty cell in the spreadsheet, say, cell E2. The cell needs to have other blank cells
below it and to the right of it. This is the area on the spreadsheet where Excel will put the results.
 Click the "OK" button.
Results

Excel puts the results of the regression analysis on the spreadsheet, starting in cell E2, and then working
down and to the right of cell E2. The results should look like this:

First, look at the "ANOVA" section of the results (in the middle of the results). This gives the
information used in the F-test of the regression model. The "Regression" row of the ANOVA section
gives RSS, and the "Residual" row gives ESS ("Residual" is another name for the estimated errors, the
𝑒̂𝑖 's). The "F" number is the Ftest number for the F test of the statistical significance of the regression
model as a whole (you must find the Fcritical number from an F table, or use Excel to calculate Fcritical on
the side). The "Significance F" number is the p-value for the F-test. In this example, the p-value is
0.0000000112, much less than α = 0.05, so the regression as a whole is statistically significant (that is,
one or more of the 𝛽̂'s in the regression equation is not zero, so one or more of the X's in the regression
equation does have a statistically significant effect on Y).
Next, look at the "Regression Statistics" section of the results (at the top of the results.) This section of
the results gives the Multiple Correlation Coefficient, which Excel calls "Multiple R." Multiple R is
the correlation between the actual data points (the Y's) and the points on the estimated regression
line/curve (the Y-hats). Multiple R ranges from 0 to +1, and the value of 0.8658 indicates that there is a
strong correlation between the data points and the regression line/curve in this example The coefficient
of determination, R2, for the regression (labeled "R-Square" in Excel), and the adjusted R2 (labeled
"Adjusted R Square") are also presented. The value of R2 is 0.7496, indicating that the regression
equation explains 74.96% of the variation in the Y variable. The section also gives the Standard Error
of the Regression (SER), which is labeled simply "Standard Error" in Excel. The SER of 4.6368
indicates that, on average, a data point is 4.6368 units away from the regression line (in the Y direction).
"Observations" gives the sample size "n" used in the regression; here n = 26.
̂ 's
Finally, look at the bottom section of results. The "Coefficient" column gives the estimates of the 𝜷
̂ 's. The "t Stat"
for the regression equation. The "Standard Error" column gives the 𝒔. 𝒆.𝜷̂'s for the 𝜷
̂ for the hypothesis test H0: 𝛽̂ = 0, H1: 𝛽̂ ≠ 0. (You must look
column gives the ttest number for each 𝜷
up the tcritical number in a t-table, or use Excel to calculate tcritical on the side.) Because the ttest value (8.48)
for X1 is farther from zero than tcritical = 2.064 (d.f. = n - k = 24, α/2 = 0.025), we reject H0 and conclude
that 𝛽̂ ≠ 0. Because 𝛽̂ ≠ 0, we conclude that X1 has a statistically significant effect on Y. The "P-value"
̂ for the hypothesis test H0: 𝛽̂ = 0, H1: 𝛽̂ ≠ 0. Because the p-value
column gives the p-value for each 𝜷


2
UNC-Wilmington
ECN 377
Department of Economics and Finance
Dr. Chris Dumas
̂
̂
is much less than α/2 = 0.025, we conclude that 𝛽 ≠ 0. Because 𝛽 ≠ 0, we conclude that X1 has a
statistically significant effect on Y. Notice that we get the same result for the hypothesis test whether we
compare ttest against tcritical, or we compare the p-value with α/2. The "Lower 95%" and "Upper 95%" give
̂ , based on the numbers in the
the upper and lower confidence interval numbers for each 𝜷
"Coefficient" column and the "Standard Error" column. Because the confidence interval does not include
zero, we reject H0 and conclude that 𝛽̂ ≠ 0. Because 𝛽̂ ≠ 0, we conclude that X1 has a statistically
significant effect on Y.
Conducting a Regression Analysis for More than One X Variable—Multiple OLS Regression









Now we will conduct a Multiple OLS Regression Analysis (a regression that uses more than one X
variable). As before, begin by selecting the Data tab at the top of the Excel window.
Select "Data Analysis" (on the right). (Note: If you don't have a "Data Analysis" button, then you need to
add the Analysis Toolpak Add-In to Excel. See above.)
Select "Regression" and click "OK." You will see the "Regression" pop-up window in Excel.
We we want to conduct a multiple OLS regression analysis of variable Y against variables X1 and X2.
Click inside the "Input Y Range" box, and then select cells A1 to A27. This tells Excel which data to use
for the Y variable in the regression analysis.
Click inside the "Input X Range" box, and then select cells B1 to C27. This tells Excel which data to use
for the X variables in the regression analysis.
Check the box "Labels," because the labels for our variables are in the first row of the spreadsheet. (If the
first row of the spreadsheet did not contain the variable names, but instead simply gave the first row of
data, then we would not check the "Labels" box.)
By default, Excel will use a confidence level of 95% for all hypothesis tests related to the regression. If
you want to change the confidence level, you can do so by checking the "Confidence Level" box and
entering the confidence level that you want. (For this example, let's use the default 95%, so don't check
the box.)
Check the button beside "Output Range," and then click inside the box to the right of "Output Range."
Then, click on an empty cell in the spreadsheet, say, cell E24. The cell needs to have other blank cells
below it and to the right of it. This is the area on the spreadsheet where Excel will put the results.
Click the "OK" button.
Results

Excel puts the results of the regression analysis on the spreadsheet, starting in cell E24, and then working
down and to the right of cell E24. The results should look like this:
3
UNC-Wilmington
Department of Economics and Finance



ECN 377
Dr. Chris Dumas
First, look at the "ANOVA" section of the results (in the middle of the results). This gives the
information used in the F-test of the regression model. The "Regression" row of the ANOVA section
gives RSS, and the "Residual" row gives ESS ("Residual" is another name for the estimated errors, the
𝑒̂𝑖 's). The "F" number is the Ftest number for the F test of the statistical significance of the regression
model as a whole (you must find the Fcritical number from an F table, or use Excel to calculate Fcritical on
the side). The "Significance F" number is the p-value for the F-test. In this example, the p-value is
0.00000010588, much less than 0.05, so the regression as a whole is statistically significant (that is, one
or more of the 𝛽̂'s in the regression equation is not zero, so one or more of the X's in the regression
equation does have a statistically significant effect on Y).
Next, look at the "Regression Statistics" section of the results (at the top of the results.) This section of
the results gives the Multiple Correlation Coefficient, R, (labeled "Multiple R" in Excel) which
measures the linear correlation between the Y values of the data points and the Y values of the regression
line/curve (which depend on all the X variables in the regression equation). Unlike the Pearson
correlation coefficient (r), which can be either positive or negative, R can only be positive, and ranges
from 0 to +1. The closer R is to +1, the better the regression line/curve fits the data points. The value R
= 0.8675 in this example indicates that there is a strong correlation between the data points and the
regression line. Excel also provides the Coefficient of Determination, R2, (labeled "R-Square" in
Excel), and the Adjusted R2 (labeled "Adjusted R Square"). Recall that we should use Adjusted R2,
rather than R2, when the regression equation has more than one X variable. In this example, the value of
Adjusted R2 is 0.731048, indicating that the regression equation explains 73.01% of the variation in the
Y variable. Excel also presents the Standard Error of the Regression (SER), (labeled "Standard Error"
in Excel), which measures the variation of the data points around the regression line. The SER of 4.7087
indicates that, on average, a data point is 4.7087 units away from the regression line (in the vertical, Y
direction). "Observations" gives the sample size "n" used in the regression; here n = 26.
̂ 's
Finally, look at the bottom section of results. The "Coefficient" column gives the estimates of the 𝜷
̂
for the regression equation. The "Standard Error" column gives the 𝒔. 𝒆.𝜷̂'s for the 𝜷's. The "t Stat"
̂ for the hypothesis test H0: 𝛽̂ = 0, H1: 𝛽̂ ≠ 0. (You must look
column gives the ttest number for each 𝜷
up the tcritical number in a t-table, or use Excel to calculate tcritical on the side.)
o Because the ttest value (6.28) for X1 is farther from zero than tcritical = 2.069 (d.f. = n - k = 23, α/2
= 0.025), we reject H0 and conclude that 𝛽̂1 ≠ 0. Because 𝛽̂1 ≠ 0, we conclude that X1 has a
̂ for the
statistically significant effect on Y. The "P-value" column gives the p-value for each 𝜷
̂
̂
̂
hypothesis test H0: 𝛽 = 0, H1: 𝛽 ≠ 0. Because the p-value for 𝛽1 is much less than α/2 = 0.025,
we conclude that 𝛽̂1 ≠ 0. Because 𝛽̂1 ≠ 0, we conclude that X1 has a statistically significant
effect on Y. Notice that we get the same result for the hypothesis test whether we compare t test
against tcritical, or we compare the p-value with α/2. The "Lower 95%" and "Upper 95%" give the
̂ , based on the numbers in the
upper and lower confidence interval numbers for each 𝜷
"Coefficient" column and the "Standard Error" column. Because the confidence interval does
not include zero, we reject H0 and conclude that 𝛽̂1 ≠ 0. Because 𝛽̂1 ≠ 0, we conclude that X1
has a statistically significant effect on Y.
o Because the ttest value (0.5224) for X2 is not farther from zero than tcritical = 2.069 (d.f. = n - k =
23, α/2 = 0.025), we don't reject H0, so we conclude that 𝛽̂2 = 0. Because 𝛽̂2 = 0, we conclude
that X2 does not have a statistically significant effect on Y. Because the p-value for 𝛽̂2, which is
equal to 0.6064, is greater than α/2 = 0.025, we conclude that 𝛽̂2 = 0. Because 𝛽̂2 = 0, we
conclude that X2 does not have a statistically significant effect on Y. Notice that we get the
same result for the hypothesis test whether we compare ttest against tcritical, or we compare the pvalue with α/2. Because the confidence interval does include zero, we don't reject H0 and
conclude that 𝛽̂2 = 0. Because 𝛽̂2 = 0, we conclude that X2 does not have a statistically
significant effect on Y.
4
Download