UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Excel—OLS Regression Analysis Recall that Regression Analysis is a method for using data to find the best estimates of the possible relationships among variables. Typically, we are interested in explaining, forecasting or predicting the behavior of one or more variables, called the dependent, or “Y,” variables, based on the behavior of one or more independent, or “X,” variables. This handout will explain how to conduct ordinary least-squares (OLS) regression analysis (both simple and multiple regression analysis) in Microsoft Excel 2013. NOTE: It is assumed in this handout that the Analysis Toolpak "Add-In" has been activated in Excel. An "Add-In" is an extra feature of Excel that is not active by default, so you must activate it. (To activate the Data Analysis "Add-In" in Excel, start Excel, and then go to the File tab at the top of the Excel window, then select "Options" on the left, and then "Add-Ins" on the left. Next, at the bottom of the window, in the box to the right of "Manage", select "Excel Add-ins", then click the "Go" button, then check the box beside "Analysis ToolPak" in the pop-up window, then click the "OK" button. After doing this, you might need to re-start Excel to activate the Add-In.) The OLSdata.xls Dataset This handout will use the OLSdata.xls dataset as an example. Go to the Handouts page of the ECN377 website and download the OLSdata.xls dataset to the ECN377 folder on the C: drive of your computer. The OLSdata.xls dataset contains 26 observations on 3 variables. The three variables are named Y, X1 and X2, and the variable names are in the first row of the dataset. Open the OLSdata.xls Dataset in Excel The first four rows should look like this (the variable names are on the first row): Conducting an OLS Regression Analysis for One X Variable—Simple OLS Regression With the OLSdata.xls dataset open in Excel: Select the Data tab at the top of the Excel window Select "Data Analysis" (on the right). (Note: If you don't have a "Data Analysis" button, then you need to add the Analysis Toolpak Add-In to Excel. See above.) Select "Regression" and click "OK." You will see the "Regression" pop-up window in Excel. As an example, suppose we want to conduct a simple OLS regression analysis of variable Y against variable X1. Click inside the "Input Y Range" box, and then select cells A1 to A27. This tells Excel which data to use for the Y variable in the regression analysis. Click inside the "Input X Range" box, and then select cells B1 to B27. This tells Excel which data to use for the X variable in the regression analysis. Check the box "Labels," because the labels for our variables are in the first row of the spreadsheet. (If the first row of the spreadsheet did not contain the variable names, but instead simply gave the first row of data, then we would not check the "Labels" box.) By default, Excel will use a confidence level of 95% for all hypothesis tests related to the regression. If you want to change the confidence level, you can do so by checking the "Confidence Level" box and entering the confidence level that you want. (For this example, let's use the default 95%, so don't check the box.) 1 UNC-Wilmington ECN 377 Department of Economics and Finance Dr. Chris Dumas Check the button beside "Output Range," and then click inside the box to the right of "Output Range." Then, click on an empty cell in the spreadsheet, say, cell E2. The cell needs to have other blank cells below it and to the right of it. This is the area on the spreadsheet where Excel will put the results. Click the "OK" button. Results Excel puts the results of the regression analysis on the spreadsheet, starting in cell E2, and then working down and to the right of cell E2. The results should look like this: First, look at the "ANOVA" section of the results (in the middle of the results). This gives the information used in the F-test of the regression model. The "Regression" row of the ANOVA section gives RSS, and the "Residual" row gives ESS ("Residual" is another name for the estimated errors, the 𝑒̂𝑖 's). The "F" number is the Ftest number for the F test of the statistical significance of the regression model as a whole (you must find the Fcritical number from an F table, or use Excel to calculate Fcritical on the side). The "Significance F" number is the p-value for the F-test. In this example, the p-value is 0.0000000112, much less than α = 0.05, so the regression as a whole is statistically significant (that is, one or more of the 𝛽̂'s in the regression equation is not zero, so one or more of the X's in the regression equation does have a statistically significant effect on Y). Next, look at the "Regression Statistics" section of the results (at the top of the results.) This section of the results gives the Multiple Correlation Coefficient, which Excel calls "Multiple R." Multiple R is the correlation between the actual data points (the Y's) and the points on the estimated regression line/curve (the Y-hats). Multiple R ranges from 0 to +1, and the value of 0.8658 indicates that there is a strong correlation between the data points and the regression line/curve in this example The coefficient of determination, R2, for the regression (labeled "R-Square" in Excel), and the adjusted R2 (labeled "Adjusted R Square") are also presented. The value of R2 is 0.7496, indicating that the regression equation explains 74.96% of the variation in the Y variable. The section also gives the Standard Error of the Regression (SER), which is labeled simply "Standard Error" in Excel. The SER of 4.6368 indicates that, on average, a data point is 4.6368 units away from the regression line (in the Y direction). "Observations" gives the sample size "n" used in the regression; here n = 26. ̂ 's Finally, look at the bottom section of results. The "Coefficient" column gives the estimates of the 𝜷 ̂ 's. The "t Stat" for the regression equation. The "Standard Error" column gives the 𝒔. 𝒆.𝜷̂'s for the 𝜷 ̂ for the hypothesis test H0: 𝛽̂ = 0, H1: 𝛽̂ ≠ 0. (You must look column gives the ttest number for each 𝜷 up the tcritical number in a t-table, or use Excel to calculate tcritical on the side.) Because the ttest value (8.48) for X1 is farther from zero than tcritical = 2.064 (d.f. = n - k = 24, α/2 = 0.025), we reject H0 and conclude that 𝛽̂ ≠ 0. Because 𝛽̂ ≠ 0, we conclude that X1 has a statistically significant effect on Y. The "P-value" ̂ for the hypothesis test H0: 𝛽̂ = 0, H1: 𝛽̂ ≠ 0. Because the p-value column gives the p-value for each 𝜷 2 UNC-Wilmington ECN 377 Department of Economics and Finance Dr. Chris Dumas ̂ ̂ is much less than α/2 = 0.025, we conclude that 𝛽 ≠ 0. Because 𝛽 ≠ 0, we conclude that X1 has a statistically significant effect on Y. Notice that we get the same result for the hypothesis test whether we compare ttest against tcritical, or we compare the p-value with α/2. The "Lower 95%" and "Upper 95%" give ̂ , based on the numbers in the the upper and lower confidence interval numbers for each 𝜷 "Coefficient" column and the "Standard Error" column. Because the confidence interval does not include zero, we reject H0 and conclude that 𝛽̂ ≠ 0. Because 𝛽̂ ≠ 0, we conclude that X1 has a statistically significant effect on Y. Conducting a Regression Analysis for More than One X Variable—Multiple OLS Regression Now we will conduct a Multiple OLS Regression Analysis (a regression that uses more than one X variable). As before, begin by selecting the Data tab at the top of the Excel window. Select "Data Analysis" (on the right). (Note: If you don't have a "Data Analysis" button, then you need to add the Analysis Toolpak Add-In to Excel. See above.) Select "Regression" and click "OK." You will see the "Regression" pop-up window in Excel. We we want to conduct a multiple OLS regression analysis of variable Y against variables X1 and X2. Click inside the "Input Y Range" box, and then select cells A1 to A27. This tells Excel which data to use for the Y variable in the regression analysis. Click inside the "Input X Range" box, and then select cells B1 to C27. This tells Excel which data to use for the X variables in the regression analysis. Check the box "Labels," because the labels for our variables are in the first row of the spreadsheet. (If the first row of the spreadsheet did not contain the variable names, but instead simply gave the first row of data, then we would not check the "Labels" box.) By default, Excel will use a confidence level of 95% for all hypothesis tests related to the regression. If you want to change the confidence level, you can do so by checking the "Confidence Level" box and entering the confidence level that you want. (For this example, let's use the default 95%, so don't check the box.) Check the button beside "Output Range," and then click inside the box to the right of "Output Range." Then, click on an empty cell in the spreadsheet, say, cell E24. The cell needs to have other blank cells below it and to the right of it. This is the area on the spreadsheet where Excel will put the results. Click the "OK" button. Results Excel puts the results of the regression analysis on the spreadsheet, starting in cell E24, and then working down and to the right of cell E24. The results should look like this: 3 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas First, look at the "ANOVA" section of the results (in the middle of the results). This gives the information used in the F-test of the regression model. The "Regression" row of the ANOVA section gives RSS, and the "Residual" row gives ESS ("Residual" is another name for the estimated errors, the 𝑒̂𝑖 's). The "F" number is the Ftest number for the F test of the statistical significance of the regression model as a whole (you must find the Fcritical number from an F table, or use Excel to calculate Fcritical on the side). The "Significance F" number is the p-value for the F-test. In this example, the p-value is 0.00000010588, much less than 0.05, so the regression as a whole is statistically significant (that is, one or more of the 𝛽̂'s in the regression equation is not zero, so one or more of the X's in the regression equation does have a statistically significant effect on Y). Next, look at the "Regression Statistics" section of the results (at the top of the results.) This section of the results gives the Multiple Correlation Coefficient, R, (labeled "Multiple R" in Excel) which measures the linear correlation between the Y values of the data points and the Y values of the regression line/curve (which depend on all the X variables in the regression equation). Unlike the Pearson correlation coefficient (r), which can be either positive or negative, R can only be positive, and ranges from 0 to +1. The closer R is to +1, the better the regression line/curve fits the data points. The value R = 0.8675 in this example indicates that there is a strong correlation between the data points and the regression line. Excel also provides the Coefficient of Determination, R2, (labeled "R-Square" in Excel), and the Adjusted R2 (labeled "Adjusted R Square"). Recall that we should use Adjusted R2, rather than R2, when the regression equation has more than one X variable. In this example, the value of Adjusted R2 is 0.731048, indicating that the regression equation explains 73.01% of the variation in the Y variable. Excel also presents the Standard Error of the Regression (SER), (labeled "Standard Error" in Excel), which measures the variation of the data points around the regression line. The SER of 4.7087 indicates that, on average, a data point is 4.7087 units away from the regression line (in the vertical, Y direction). "Observations" gives the sample size "n" used in the regression; here n = 26. ̂ 's Finally, look at the bottom section of results. The "Coefficient" column gives the estimates of the 𝜷 ̂ for the regression equation. The "Standard Error" column gives the 𝒔. 𝒆.𝜷̂'s for the 𝜷's. The "t Stat" ̂ for the hypothesis test H0: 𝛽̂ = 0, H1: 𝛽̂ ≠ 0. (You must look column gives the ttest number for each 𝜷 up the tcritical number in a t-table, or use Excel to calculate tcritical on the side.) o Because the ttest value (6.28) for X1 is farther from zero than tcritical = 2.069 (d.f. = n - k = 23, α/2 = 0.025), we reject H0 and conclude that 𝛽̂1 ≠ 0. Because 𝛽̂1 ≠ 0, we conclude that X1 has a ̂ for the statistically significant effect on Y. The "P-value" column gives the p-value for each 𝜷 ̂ ̂ ̂ hypothesis test H0: 𝛽 = 0, H1: 𝛽 ≠ 0. Because the p-value for 𝛽1 is much less than α/2 = 0.025, we conclude that 𝛽̂1 ≠ 0. Because 𝛽̂1 ≠ 0, we conclude that X1 has a statistically significant effect on Y. Notice that we get the same result for the hypothesis test whether we compare t test against tcritical, or we compare the p-value with α/2. The "Lower 95%" and "Upper 95%" give the ̂ , based on the numbers in the upper and lower confidence interval numbers for each 𝜷 "Coefficient" column and the "Standard Error" column. Because the confidence interval does not include zero, we reject H0 and conclude that 𝛽̂1 ≠ 0. Because 𝛽̂1 ≠ 0, we conclude that X1 has a statistically significant effect on Y. o Because the ttest value (0.5224) for X2 is not farther from zero than tcritical = 2.069 (d.f. = n - k = 23, α/2 = 0.025), we don't reject H0, so we conclude that 𝛽̂2 = 0. Because 𝛽̂2 = 0, we conclude that X2 does not have a statistically significant effect on Y. Because the p-value for 𝛽̂2, which is equal to 0.6064, is greater than α/2 = 0.025, we conclude that 𝛽̂2 = 0. Because 𝛽̂2 = 0, we conclude that X2 does not have a statistically significant effect on Y. Notice that we get the same result for the hypothesis test whether we compare ttest against tcritical, or we compare the pvalue with α/2. Because the confidence interval does include zero, we don't reject H0 and conclude that 𝛽̂2 = 0. Because 𝛽̂2 = 0, we conclude that X2 does not have a statistically significant effect on Y. 4