STATISTICAL ANALYSIS USING EXCEL By Dr. Wisuttorn Jitaree Faculty of Business Administration Chiang Mai University Thailand 1 1. Understanding statistical techniques Before commencing the discussion of inferential statistics it is necessary to introduce learners to a few other concepts and the first issue we address is the shape of the data through data frequency tables. 2. Data Frequency Tables In this example we have obtained 30 completed and usable questionnaires from students from the School of Business and another 30 completed and usable questionnaires from students from the School of Accounting. We have 60 data points in all. The 30 respondents from the School of Business supplied the following number of months working experience: Table 2.1: School of Business, number of months working experience 23 54 28 55 29 56 34 56 34 65 39 65 43 65 44 67 45 73 45 76 48 76 48 77 49 78 54 87 54 92 28 56 29 65 32 67 33 76 Respondents from the School of Accounting replied with the following data: Table 2.2: School of Accounting, number of months working experience 10 12 12 16 19 20 22 23 23 23 26 34 34 41 43 43 44 45 45 54 56 56 Excel has a function which allows data frequency tables to be constructed which is called =frequency(). Next, the required data distribution in intervals has been entered into the range I6 through I11 Data distribution 25 26 48 60 70 >72 Excel has a function which allows data frequency tables to be constructed which is called =frequency(). =frequency(C3:C32,I6:I11) F2 + [CTRL + Shift + Enter] 2 Table 2.3: School of Business Frequency Table of the number of students and the number of months working experience. Months Experience No. of Students Under 25 1 26-36 4 37-48 7 49-60 7 61-72 4 Above 72 7 Total 30 Table 2.4: School of Business Frequency Table with relative frequency Months Experience No. of Students Relative frequency Under 25 1 3.33 26-36 4 13.3 37-48 7 23.3 49-60 7 23.3 61-72 4 13.3 Above 72 7 23.3 30 100 Total Table 2.5: School of Business Frequency Table with relative frequency and cumulative relative frequency Cumulative Months No. of Students Relative relative Experience frequency frequency Under 25 1 3.33 3.33 26-36 4 13.3 16.67 37-48 7 23.3 40.00 49-60 7 23.3 63.33 61-72 4 13.3 76.67 Above 72 7 23.3 100.00 30 100 Total These frequency tables may be plotted as histograms. Figure 2.1 shows the results for the School of Business and. Figure 2.2 shows the results for the School of Accounting. 3 No. of Students 8 6 4 2 0 Under 25 26-36 37-48 49-60 61-72 Above 72 Figure 2.1: A histogram used to examine the shape of the data frequency table for the School of Business No. of Students 12 10 8 6 4 2 0 Under 25 26-36 37-48 49-60 61-72 Above 72 Figure 2.2: A histogram used to examine the shape of the data frequency table for the School of Accounting These two data sets can be plotted on the same axis using a line graph shown in Figure 2.3 Number of students with differing preiods of work experience from School of Business and School of Accounting 15 10 5 0 Under 25 26-36 37-48 SOB 49-60 61-72 Above 72 SOA Figure 2.3: The School of Business and the School of Accounting data as traces on one graph 3. Descriptive statistics The Descriptive Statistics analysis tool generates a report of univariate statistics for data in the input range, providing information about the central tendency and variability of your data. 4 The results is shown in figure 3.1 Column1 Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level(95.0%) Figure 3.1: The descriptive statistics of sample data 55.3 3.256170094 54 54 17.83477811 318.0793103 -0.61398127 0.1850153 69 23 92 1659 30 6.659615595 4. Covariance The Correlation and Covariance tools can both be used in the same setting, when you have N different measurement variables observed on a set of individuals. The Correlation and Covariance tools each give an output table, a matrix that shows the correlation coefficient or covariance, respectively, between each pair of measurement variables. The difference is that correlation coefficients are scaled to lie between -1 and +1 inclusive. Corresponding covariances are not scaled. Both the correlation coefficient and the covariance are measures of the extent to which two variables "vary together." 5 The results is shown in figure 4.1 Column 1 Column 1 Column 2 307.4767 299.8456 5. Anova The Anova analysis tools provide different types of variance analysis. The tool that you should use depends on the number of factors and the number of samples that you have from the populations that you want to test Anova: Single Factor This tool performs a simple analysis of variance on data for two or more samples. The analysis provides a test of the hypothesis that each sample is drawn from the same underlying probability distribution against the alternative hypothesis that underlying probability distributions are not the same for all samples. If there are only two samples, you can use the worksheet function T.TEST. With more than two samples, there is no convenient generalization of T.TEST, and the Single Factor Anova model can be called upon instead. 6 The results is shown in figure 5.1 Anova: Single Factor SUMMARY Groups Count Sum Average Variance Column 1 30 1659 55.3 318.0793 Column 2 30 1087 36.23333 310.1851 MS F P-value F crit 17.35915 0.000104 4.006873 ANOVA Source of Variation SS df Between Groups 5453.067 1 5453.067 Within Groups 18219.67 58 314.1322 Total 23672.73 59 6. Correlation The correlation coefficient, like the covariance, is a measure of the extent to which two measurement variables "vary together." Unlike the covariance, the correlation coefficient is scaled so that its value is independent of the units in which the two measurement variables are expressed. (For example, if the two measurement variables are weight and height, the value of the correlation coefficient is unchanged if weight is converted from pounds to kilograms.) The value of any correlation coefficient must be between -1 and +1 inclusive. The results is shown in figure 6.1 Column 1 Column 2 Column 1 1 Column 2 0.986474 7. 1 F-Test Two-Sample for Variances The F-Test Two-Sample for Variances analysis tool performs a two-sample F-test to compare two population variances. For example, you can use the F-Test tool on samples of times in a swim meet for each of two teams. The tool provides the result of a test of the null hypothesis that these two samples come from distributions with equal variances, against the alternative that the variances are not equal in the underlying distributions. The tool calculates the value f of an F-statistic (or F-ratio). A value of f close to 1 provides evidence that the underlying population variances are equal. In the output table, if f < 1 "P(F <= f) one-tail" gives the probability of observing a value of the F-statistic less than f when population variances are equal, and "F Critical one-tail" gives the critical value less than 1 for 7 the chosen significance level, Alpha. If f > 1, "P(F <= f) one-tail" gives the probability of observing a value of the F-statistic greater than f when population variances are equal, and "F Critical one-tail" gives the critical value greater than 1 for Alpha. The results is shown in figure 7.1 F-Test Two-Sample for Variances Variable 1 Mean Variable 2 55.3 36.23333333 318.0793103 310.1850575 Observations 30 30 df 29 29 F 1.025450139 P(F<=f) one-tail 0.473256022 F Critical one-tail 1.860811435 Variance 8. Histogram The Histogram analysis tool calculates individual and cumulative frequencies for a cell range of data and data bins. This tool generates data for the number of occurrences of a value in a data set. The results is shown in figure 8.1 8 Histogram Frequency 1.5 1 Frequency 0.5 0 Bin Bin Frequency 28 1 40.8 4 53.6 7 66.4 9 79.2 6 More 9. 2 Regression The Regression analysis tool performs linear regression analysis by using the "least squares" method to fit a line through a set of observations. You can analyze how a single dependent variable is affected by the values of one or more independent variables. For example, you can analyze how an athlete's performance is affected by such factors as age, height, and weight. You can apportion shares in the performance measure to each of these three factors, based on a set of performance data, and then use the results to predict the performance of a new, untested athlete. The results is shown in figure 9.1 SUMMARY OUTPUT Regression Statistics 0.986988 Multiple R 827 0.974146 R Square 945 Adjusted R 0.972231 Square 903 Standard 1.466979 Error 035 Observations 30 9 ANOVA df Residual 27 SS 2189.39 5 58.1047 4 Total 29 2247.5 Coefficien ts 8.558426 59 0.329390 252 0.161264 37 Standar d Error Regression Intercept X Variable 1 X Variable 2 2 1.88544 2 0.09318 0.09435 8 Significa nce F MS 1094.6 98 2.1520 27 F 508.68 2 3.71E-22 t Stat P-value Lower 95% 4.5392 2 3.5349 82 1.7090 61 0.0001 05 0.0014 93 0.0989 12 Upper 95% 4.6898 2 0.5205 8 0.3548 72 -12.427 0.1382 -0.03234 Lower 95.0% Upper 95.0% -12.427 4.68982 0.1382 0.03234 0.52058 0.35487 2 10. t-Test The Two-Sample t-Test analysis tools test for equality of the population means that underlie each sample. The three tools employ different assumptions: that the population variances are equal, that the population variances are not equal, and that the two samples represent beforetreatment and after-treatment observations on the same subjects. The results is shown in figure 10.1 t-Test: Paired Two Sample for Means Variable 1 Mean Variance Observations Pearson Correlation Hypothesized Mean Difference df t Stat Variable 2 55.3 36.23333333 318.0793103 310.1850575 30 30 0.98647353 0 29 35.7211422 P(T<=t) one-tail 7.92407E-26 t Critical one-tail 1.699127027 P(T<=t) two-tail 1.58481E-25 t Critical two-tail 2.045229642 10 11. z-Test The z-Test: Two Sample for Means analysis tool performs a two sample z-Test for means with known variances. This tool is used to test the null hypothesis that there is no difference between two population means against either one-sided or two-sided alternative hypotheses. If variances are not known, the worksheet function Z.TEST should be used instead. The results is shown in figure 11.1 11 Pivot Tables Pivot tables are one of Excel's most powerful features. A pivot table allows you to extract the significance from a large, detailed data set. Our data set consists of 214 rows and 6 fields. Order ID, Product, Category, Amount, Date and Country. Insert a Pivot Table To insert a pivot table, execute the following steps. 1. Click any single cell inside the data set. 2. On the Insert tab, click PivotTable. Drag fields The PivotTable field list appears. To get the total amount exported of each product, drag the following fields to the different areas. 1. Product Field to the Row Labels area. 2. Amount Field to the Values area. 3. Country Field to the Report Filter area. 12 Below you can find the pivot table. Bananas are our main export product. That's how easy pivot tables can be! Sort To get Banana at the top of the list, sort the pivot table. 1. Click any cell inside the Total column. 2. The PivotTable Tools contextual tab activates. On the Options tab, click the Sort Largest to Smallest button (ZA). 13 Result. Filter Because we added the Country field to the Report Filter area, we can filter this pivot table by Country. For example, which products do we export the most to France? 1. Click the filter drop-down and select France. Result. Apples are our main export product to France. 14 Note: you can use the standard filter (triangle next to Product) to only show the totals of specific products. Change Summary Calculation By default, Excel summarizes your data by either summing or counting the items. To change the type of calculation that you want to use, execute the following steps. 1. Click any cell inside the Total column. 2. Right click and click on Value Field Settings... 3. Choose the type of calculation you want to use. For example, click Count. 15 4. Click OK. Result. 16 out of the 28 orders to France were 'Apple' orders. Two-dimensional Pivot Table If you drag a field to the Row Labels area and Column Labels area, you can create a two-dimensional pivot table. For example, to get the total amount exported to each country, of each product, drag the following fields to the different areas. 1. Country Field to the Row Labels area. 2. Product Field to the Column Labels area. 3. Amount Field to the Values area. 4. Category Field to the Report Filter area. 16 Below you can find the two-dimensional pivot table. To easily compare these numbers, create a pivot chart and apply a filter. Maybe this is one step too far for you at this stage, but it shows you one of the many other powerful pivot table features Excel has to offer. 17 18 Pivot Chart A pivot chart is the visual representation of a pivot table in Excel. Pivot charts and pivot tables are connected with each other. Below you can find a two-dimensional pivot table. Go back to Pivot Tables to learn how to create this pivot table. Insert Pivot Chart To insert a pivot chart, simply insert a chart. 1. Click any cell inside the pivot table. 2. On the Insert tab, click Column and select one of the subtypes. For example, Clustered Column. 19 Below you can find the pivot chart. This pivot chart will amaze and impress your boss. 20 Note: any changes you make to the pivot chart are immediately reflected in the pivot table and vice versa. Filter Pivot Chart To filter this pivot chart, execute the following steps. 1a. Use the standard filters (triangles next to Product and Country). For example, use the Country filter to only show the total amount of each product exported to the United States. 21 1b. Because we added the Category field to the Report Filter area, we can filter this pivot chart (and pivot table) by Category. For example, use the Category filter to only show the vegetables exported to each country. 22 Change Pivot Chart Type You can change to a different type of pivot chart at any time. 1. Select the chart. 2. The PivotChart tools contextual tab activates. On the Design tab, click Change Chart Type. 3. Choose Pie. 23 4. Click OK. 24 Note: pie charts always use one data series (in this case, Apple). To get a pivot chart of a country, swap the data over the axis. Select the chart. The PivotChart tools contextual tab activates. On the Design tab, click Switch Row/Column. 25 SORT You can sort your Excel data on one column or multiple columns. You can sort in ascending or descending order. One Column To sort on one column, execute the following steps. 1. Click any cell in the column you want to sort. 2. To sort in ascending order, on the Data tab, click AZ. Result: 26 Note: to sort in descending order, click ZA. Multiple Columns To sort on multiple columns, execute the following steps. 1. On the Data tab, click Sort. The Sort dialog box appears. 2. Select Last Name from the 'Sort by' drop-down list. 27 3. Click on Add Level. 4. Select Sales from the 'Then by' drop-down list. 5. Click OK. Result. Records are sorted by Last Name first and Sales second. 28 Conditional formatting in Excel enables you to highlight cells with a certain color, depending on the cell's value. Highlight Cells Rules To highlight cells that are greater than a value, execute the following steps. 1. Select the range A1:A10. 2. On the Home tab, click Conditional Formatting, Highlight Cells Rules, Greater Than... 29 3. Enter the value 80 and select a formatting style. 4. Click OK. Result. Excel highlights the cells that are greater than 80. 30 5. Change the value of cell A1 to 81. Result. Excel changes the format of cell A1 automatically. Note: you can also highlight cells that are less than a value, between a low and high value, etc. Clear Rules To clear a conditional formatting rule, execute the following steps. 1. Select the range A1:A10. 31 2. On the Home tab, click Conditional Formatting, Clear Rules, Clear Rules from Selected Cells. Top/Bottom Rules To highlight cells that are above the average of the cells, execute the following steps. 1. Select the range A1:A10. 32 2. On the Home tab, click Conditional Formatting, Top/Bottom Rules, Above Average... 3. Select a formatting style. 33 4. Click OK. Result. Excel calculates the average (42.5) and formats the cells that are above this average. Note: you can also highlight the top 10 items, the top 10 %, etc. The sky is the limit! VLookup Function Learn all about Excel's lookup & reference functions such as the VLOOKUP, HLOOKUP, MATCH, and CHOOSE function. VLookup The VLOOKUP (Vertical lookup) function looks for a value in the leftmost column of a table, and then returns a value in the same row from another column you specify. 1. Insert the VLOOKUP function shown below. Explanation: the VLOOKUP function looks for the ID (104) in the leftmost column of the range $E$4:$G$7 and returns the value in the same row from the third column (third argument is set to 3). The fourth argument is set to FALSE to return an exact match or a #N/A error if not found. 34 2. Drag the VLOOKUP function in cell B2 down to cell B11. Note: when we drag the VLOOKUP function down, the absolute reference ($E$4:$G$7) stays the same, while the relative reference (A2) changes to A3, A4, A5, etc. HLookup In a similar way, you can use the HLOOKUP (Horizontal lookup) function. Match The MATCH function returns the position of a value in a given range. 35 Note: Yellow found at position 3 in the range E4:E7. The third argument is optional. Set this argument to 0 to return the position of the value that is exactly equal to lookup_value (A2) or a #N/A error if not found. Note: 97 found at position 3 in the range E4:E7. Choose The CHOOSE function returns a value from a list of values, based on a position number. Note: Boat found at position 3. 36 Data Validation Use data validation in Excel to make sure that users enter certain values into a cell. Data Validation Example In this example, we restrict users to enter a whole number between 0 and 10. Create Data Validation Rule To create the data validation rule, execute the following steps. 1. Select cell C2. 2. On the Data tab, click Data Validation. On the Settings tab: 3. In the Allow list, click Whole number. 4. In the Data list, click between. 5. Enter the Minimum and Maximum values. Input Message Input messages appear when the user selects the cell and tell the user what to enter. On the Input Message tab: 1. Check 'Show input message when cell is selected'. 37 2. Enter a title. 3. Enter an input message. Error Alert If users ignore the input message and enter a number that is not valid, you can show them an error alert. On the Error Alert tab: 1. Check 'Show error alert after invalid data is entered'. 2. Enter a title. 3. Enter an error message. 4. Click OK. Data Validation Result 1. Select cell C2. 38 2. Try to enter a number higher than 10. Result: Note: to remove data validation from a cell, select the cell, on the Data tab, click Data Validation, and then click Clear All. 39 Logical Function Learn how to use Excel's logical functions such as the IF, AND and OR function. If Function The IF function checks whether a condition is met, and returns one value if TRUE and another value if FALSE. 1. Select cell C2 and enter the following function. The IF function returns Correct because the value in cell A1 is higher than 10. And Function The AND Function returns TRUE if all conditions are true and returns FALSE if any of the conditions are false. 1. Select cell D2 and enter the following formula. The AND function returns FALSE because the value in cell B2 is not higher than 5. As a result the IF function returns Incorrect. Or Function The OR function returns TRUE if any of the conditions are TRUE and returns FALSE if all conditions are false. 1. Select cell E2 and enter the following formula. The OR function returns TRUE because the value in cell A1 is higher than 10. As a result the IF function returns Correct. General note: the AND and OR function can check up to 255 conditions. 40