UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Excel--Correlation Analysis Recall that Correlation Analysis measures the strength/degree of linear relationship (if any) between two quantitative measurement variables. This handout will explain how to conduct correlation analysis in Microsoft Excel 2013. NOTE: It is assumed in this handout that the Analysis Toolpak "Add-In" has been activated in Excel. An "Add-In" is an extra feature of Excel that is not active by default, so you must activate it. (To activate the Data Analysis "Add-In" in Excel, start Excel, and then go to the File tab at the top of the Excel window, then select "Options" on the left, and then "Add-Ins" on the left. Next, at the bottom of the window, in the box to the right of "Manage", select "Excel Add-ins", then click the "Go" button, then check the box beside "Analysis ToolPak" in the pop-up window, then click the "OK" button. After doing this, you might need to re-start Excel to activate the Add-In.) The ProcCorrData.xls Dataset This handout will use the ProcCorrData.xls dataset as an example. Go to the Handouts page of the ECN377 website and download the ProcCorrData.xls dataset to the ECN377 folder on the C: drive of your computer. The ProcCorrData.xls dataset contains data on 9 variables for a random sample of 45 North Carolina counties (out of the total population of 100 North Carolina counties), as described in the table below: Variable Name CntyName PopCens LandArea PM10Area HousingUnits EmpManf2000 VehRegs PavedMiles MeanFamInc Variable Definition Name of county in North Carolina Population in county in year 2000 Land area (square miles) in county in year 2000 Air pollution index (estimated emissions in tons of air pollution particles less than 10 micrometers in size) for county in year 2000 Total of houses, apartments, mobile homes, etc., in county in year 2000 Manufacturing employment in county in year 2000 Number of cars and trucks registered (owned and located) in county in year 2000 Number of miles of paved roads in county in year 2000 Average (mean) household income in county in year 2000 Open the ProcCorrData.xls Dataset in Excel The first four rows should look like this: 1 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Conducting a Correlation Analysis on Two Variables With the ProcCorrData.xls dataset open in Excel: Select the Data tab at the top of the Excel window Select "Data Analysis" (on the right). (Note: If you don't have a "Data Analysis" button, then you need to add the Analysis Toolpak Add-In to Excel. See above.) Select "Correlation" and click "OK." You will see the "Correlation" pop-up window. Click inside the "Input Range" box, and then select cells B1 to C46. This tells Excel which data to use for the correlation analysis. In this case, we are conducting a correlation analysis for variables PopCens and LandArea. NOTE: The two variables that you want to analyze need to be in adjacent columns. If the variables you want to analyze are not in adjacent columns, move the columns so that the two variables you want to analyze are in adjacent columns. Check the box "Labels in First Row," because the labels for our variables are in the first row of the spreadsheet. (If the first row of the spreadsheet did not contain the variable names, but instead simply gave the first row of data, then we would not check the "Labels in First Row" box.) Check the button beside "Output Range," and then click inside the box to the right of "Output Range." Then, click on an empty cell in the spreadsheet, say, cell K2. The cell needs to have other blank cells below it and to the right of it. This is the area on the spreadsheet where Excel will put the results. Click the "OK" button. Results Excel puts the results of the correlation analysis on the spreadsheet, starting in cell K2, and then working down and to the right of cell K2. The results are presented in the form of a Correlation Matrix. (See the Correlation handout for a description of a Correlation Matrix.) In this example, the correlation between variables PopCens and LandArea is r = 0.292234 . Excel gives you the results for only the lower half of the Correlation Matrix, because the numbers in the upper half of the matrix are the same as the numbers in the lower half. Notice that, unlike SAS, Excel does not give you the p-value (or the t-test value) for the hypothesis test H0: ρ=0 vs. H1: ρ≠0. So, if you accept the r value that Excel produces, you are assuming that H1: ρ≠0 is true, without testing it. That can be risky. If you want, you can conduct the hypothesis test H0: ρ=0 vs. H1: ρ≠0 on the side, on scratch paper, using the r value produced by Excel in the t-test formula 𝑟√𝑛−2 𝑡𝑡𝑒𝑠𝑡 = , as described in the Correlation handout. You would need to compare the ttest number with √1−𝑟 2 a tcritical number from a t-table, using the significance level (α) you want, and using degrees of freedom d.f. = n – 2. Conducting a Correlation Analysis for More than Two Variables Back in Excel, follow the same steps as described above for the analysis of two variables, except select cells B1 to I46 instead of cells B1 to C46. The results for all of the selected variables are presented in a Correlation Matrix. Again, only the results for the lower half of the matrix are presented, because the results in the upper half of the matrix are the same as the results in the lower half. 2