1 USING MICROSOFT® EXCEL FOR STATISTICAL ANALYSIS 1.1 INTRODUCTION TO EXCEL AS A STATISTICAL TOOL Excel is undeniably the dominant spreadsheet package in use today. As part of the Microsoft® suite of programmes it integrates seamlessly with Word and PowerPoint, and it can also be used as a convenient tool for capturing and storing field data such at that gathered for the Samouel’s restaurant study. Standard cut and paste techniques can often be used to transfer data records into and out of Excel when more formal file importing procedures are unavailable. In addition to its standard spreadsheet capabilities that are well-known to most readers, Excel offers wide range of statistical routines that may be considered adequate for most standard survey applications. Developing an appreciation of this potential of Excel can save you considerable money, given the expense involved in buying comprehensive specialist statistical software, and time, given the learning curve attached to many of these specialist packages. This note provides a brief introduction to Excel for readers that have never used the package before introducing some of the more useful statistics that can be accessed within Excel using its standard built-in routines and through its add-in functionality. 1.2 EXCEL BASICS Upon first loading Excel, you will see a screen like the one displayed in Exhibit 1. This screen provides all the functionality you will require to analyse data, present it in various graphical and text formats, and to export the desired output to standard presentation or word processing packages such as PowerPoint and Word. Most of what appears in Exhibit 1 is obvious to the regular computer user including the menu bar structure, the scroll bars, and minimise, maximise and close icons. Distinguishing characteristics of Excel are the cell structure that is 2 identified by the column and row labelling and the sheet tabulations displayed at the bottom of the screen. Currently, the selected cell is B9 as identified by the bold box around its borders. Letters of the alphabet are used to identify the columns and Arabic numerals are used to identify the rows of the displayed sheet. Each sheet has a limiting capacity of 256 columns labelled A through IV and 65,536 rows labelled 1 through 65,546. This is far more than the average user will ever need. Exhibit 1: Opening Screen for an Excel Workbook named Book1 Selecting any cell on the spreadsheet by moving the cursor and pressing the left mouse button will result in the selected cell being identified in the cell identification box and the contents of the cell being displayed in the cell contents box. Exhibit 1 shows these to be B9 and =sum(B3:B8) respectively. The solution to the contents box is obtained by adding the contents of cells B3 down to B8 and it is 3 displayed in cell B9. This is the basic principle of spreadsheets. Each of the cells within the spreadsheet can contain numerical data, text data or formulae that calculate values based upon the contents of other referenced cells. In the illustration described above, cells B3 to B8 contain the numerical data 1, 2, 3, 4, 5 and 6 respectively while cell B9 contains the formula needed to add these numbers together. Cell B2 contains the text data “Samouel’s Restaurant”. Although the text appears to extend into cell C2, this is not the case but merely seems to be so because the column display width is narrower than the text it contains. Formatting areas of a spreadsheet is achieved by highlighting the desired elements of the spreadsheet and then using the Format menu contained in the menu bar or by referring directly to the icon options that are on display. This approach has been adopted to make the contents of cell B9 appear in bold and have lines (borders) above and below the summed value of 21. Exhibit 2 displays the scope of formatting available. Individual cells and ranges of cells can be formatted based upon desired numerical presentation, alignment, font style and colour, border, pattern or shading, and whether they should be protected against overwriting. 4 Exhibit 2: 1.3 Options available once the Format>Cells dropdown menu has been selected STANDARD BUILT-IN STATISTICAL OPTIONS Excel contains approximately eighty easily accessible built-in statistical functions. These are accessed by moving the cursor to the cell where you wish to place the desired output and entering “=” into the cell. Immediately this is done, the Cell Identification Box described in Exhibit 1 will change its appearance to grey and display a function for selection (the displayed function may be one that you have used recently). Placing the cursor over the down arrow displayed immediately to the right of the function label and clicking the left mouse button causes a drop down box to appear as displayed on the left hand side of Exhibit 3. Once the More Functions … option is selected by again highlighting it with the cursor and clicking the left mouse button, the next drop down box appears. It is displayed on the right hand side of Exhibit 3. Selecting the category Statistical lists all the standard statistical functions available for selection. 5 Exhibit 3: Displaying the standard built-in statistical functions for selection Excel provides continued support to help you use the built-in function of your choice. This is best illustrated by way of an example. Assume that you wish to test for the equivalence of means between two data samples. Exhibit 4 displays the samples in columns B and C respectively. Each sample contains twenty data points and the drop down menu is obtained by selecting the TTEST statistical function using the methodology outlined above. As can be seen in the exhibit, this statistics requires four inputs. Array1 contains the range of cells B2:B21. It is created by either typing the range directly into the data box or by first clicking on the red arrow located at the right of the data box and then simply highlighting the cells containing the first sample data. Array2 identifies the second sample as being in cells C2:C21 in the identical manner. Tails is designed to contain either a 1 for a single tailed t-statistic or a 2 for a two tailed t-statistic. Finally, Type should contain 1 if a paired comparison test is required, 2 for a two-sample equal variance test or 3 for a two-sample unequal variance test. Although not shown in the exhibit, entering 2 in the Type data box will result in the answer 0.097120096 appearing at the lower equal sign. Selecting [OK] transfers this result to cell E1. 6 As declared in the drop down box, the answer provided by this test is the probability associated with the Student’s t-Test. A final point to note from the above illustration is that the Cell Contents Box contains the exact format required for the formula, namely: =TTEST(B2:B21,C2:C21,2,2) where the four arguments contained within brackets in the formula are for Array1, Array2, Tails and Type respectively. Exhibit 4: Comparison of means using the TTEST function Although listed alphabetically, the popularly used built in statistical functions may be categorised as descriptive, inferential and distributional. These are presented and discussed in Exhibit 5. 7 Exhibit 5: List of commonly used built in statistical functions A. Descriptive Statistics Function: =AVERAGE(Array1,Array2,Array3,…) Purpose: Computes the arithmetic average of a range of numbers. Argument(s): Arrays or values separated by commas. Examples include (B1:B35) to average the thirty numbers contained in the identified array or (B1:B10,12,C1:C20) to average the ten numbers in the first array, the number 12 and the twenty numbers in the second array. Function: =CORREL(Array1,Array2) Purpose: Computes the correlation between the two identified arrays. Argument(s): The two arrays are separated by commas and are as described for the AVERAGE function. =COUNT(Array1,Array2,Array3,…) Computes the number of numerical values contained within the identified arrays. Argument(s): Arrays or values separated by commas as described for the AVERAGE function. Function: Purpose: Function: =COVAR(Array1,Array2) Purpose: Computes the covariance between the two identified arrays. Argument(s): The two arrays are separated by commas and are as described for the AVERAGE function. Function: Purpose: =FREQUENCY(Data_Array,Bin_Array) Computes the frequency count of an array of numbers based upon a pre-specified bin range. Argument(s): Data Array contains the numerical values that you want to develop a frequency count for and Bin Array contains the reference values that you with to group the data into. An example of this would be (B1:B200,C1:C5) where cells C1 to C5 contain the numbers 0.2, 0.4, 0.6, 0.8 and 1.0. The solution will appears in the cell as a vector one observation longer than the Bin Array. It includes the number of values smaller than 0.2; the number between 0.2 and 0.4, the number between 0.4 and 0.6 and so on. Each value in the vector may be shown by using the =INDEX(FREQUENCY(Data_Array,Bin_Array),Index_Value) function. For example if the Index Value from the above example is 6 then the cell containing the function will give the number of observations greater than 1.0 in the original Data Array. Function: =KURT(Array1,Array2,Array3,…) 8 Purpose: Computes the kurtosis of a range of numbers. Argument(s): Arrays or values separated by commas as described for the AVERAGE function. Function: =MAX(Array1,Array2,Array3,…) Purpose: Computes the maximum number of a range of numbers. Argument(s): Arrays or values separated by commas as described for the AVERAGE function. Function: =MEDIAN(Array1,Array2,Array3,…) Purpose: Computes the median or middle number of a range of numbers. Argument(s): Arrays or values separated by commas as described for the AVERAGE function. Function: =MIN(Array1,Array2,Array3,…) Purpose: Computes the minimum number of a range of numbers. Argument(s): Arrays or values separated by commas as described for the AVERAGE function. =MODE(Array1,Array2,Array3,…) Computes the mode or most frequently occurring number of a range of numbers. Argument(s): Arrays or values separated by commas as described for the AVERAGE function. Function: Purpose: Function: Purpose: =PEARSON(Array1,Array2) Computes the Pearson product moment correlation between the two identified arrays. Argument(s): The two arrays are separated by commas and are as described for the AVERAGE function. Function: =PERCENTILE(Array,K) Purpose: Computes the minimum number of a range of numbers. Argument(s): Array contains the numeric values that you want to find the percentile of and K is a percentile number between 0 and 1. =SKEW(Array1,Array2,Array3,…) Computes the skewness or degree of asymmetry of a range of numbers. Argument(s): Arrays or values separated by commas as described for the AVERAGE function. Function: Purpose: Function: Purpose: =STDEV(Array1,Array2,Array3,…) Computes the sample standard deviation of a range of numbers. 9 Argument(s): Arrays or values separated by commas as described for the AVERAGE function. Function: =VAR(Array1,Array2,Array3,…) Purpose: Computes the sample variance of a range of numbers. Argument(s): Arrays or values separated by commas as described for the AVERAGE function. B. Inferential Statistics Function: Purpose: =CHITEST(Actual,Expected) Computes the test for independence of two classification techniques using the Chi-square distribution and the appropriate degrees of freedom. The value produces is the probability that the two classifications are independent. Argument(s): Actual is the array containing the observed frequencies and Expected is the array containing the expected observations assuming the row and column classifications are independent. As example consider data classroom containing 24 male students and 18 female students and that 9 male students and 8 female student smoke cigarettes. This test can be used to assess whether smoking is independent of gender. The Actual data for this test may be set up in cell A1:B2 with A1 containing the number of male smokers (9), B1 containing the number of males who do not smoke (15), A2 containing the female smokers (8) and B2 containing the number of females who do not smoke. If the Expected data assuming independence is placed in cells D1:E2, then the function =CHITEST(A1:B2,D1:E2) will produce the result 0.650014 indicating a high probability that the two classifications of gender and smoking are independent. Function: =CONFIDENCE(Alpha,SD,Size) Purpose: Computes the confidence interval for a population mean. Argument(s): Alpha is the significance level used to compute the confidence interval, SD is the sample standard deviation and Size is the sample size. The value returned must be added to and subtracted from the sample mean to produce the desired confidence interval. Function: Purpose: =FTEST(Array1,Array2) Computes the result of an F-test that the variances in the two arrays are not significantly different using a one tailed test. Argument(s): Array1 contains the first data sample and Array2 contains the second data sample. Function: =LINEST(Ys,Xs,Const,Stat) 10 Purpose: Computes the results of a univariate or multi-variate linear regression. Argument(s): Ys is a vector of containing the dependent variable for the regression, Xs is a vector or array containing the independent variable(s) for the regression, Const is a logical value of True or False indicating whether a constant of the regression is allowed or whether the regression equation should be forced through the origin, and Stat is a logical value of True or False indicating whether additional statistics are required. As example consider =LINEST(A1:A20,B1:D20,TRUE,TRUE). Here the dependent variable consists of the twenty observations in cells A1 to A20. There are three independent variables contained in cells B1 to B20, C1 to C20 and D1 to D20 respectively. The function allows for a regression constant and additional statistics are produced. The full statistics produced by the function are: (1) the slope coefficients for each of the independent variables and the constant of the regression; (2) the standard errors of each slope coefficient and of the constant to test for their respective significances; (3) the RSquare and standard error of the regression; (4) the F-statistic and the regression degrees of freedom; and (5) the sum of squares of the regression and sum of squares of the residuals. As with the FREQUENCY function, the output of this function is not a single value but an array. As such the =INDEX(LINEST(…),Row,Column) must be used to access individual outputs. The =LINEST output array contains the coefficients in its first row starting with the last independent variable in the first column and progressing to the regression constant. The corresponding standard errors are contained in the second row. The R-Square and the standard error of the regression are contained in the first and second column respectively of the third row. The F-statistic and the residual degrees of freedom are contained in the first and second column respectively of the fourth row. Finally, the regression and residual sum of squares are contained in the first and second column respectively of the fifth row. Function: Purpose: =TTEST(Array1,Array2,Tails,Type) Computes the result of a Student’s t-test that the means in the two arrays are not significantly different. Argument(s): Array1 contains the first data sample, Array2 contains the second data sample, Tails indicates whether a one tailed or two tailed test is to be conducted by containing the integer 1 or 2 respectively (or a cell reference to one of these numbers) and Type should contain the integer 1, 2 or 3 (or a cell reference to one of these numbers) to indicate whether a paired test is being conducted, whether a test assuming equal variance across the two populations from which the samples have been drawn or whether a test allowing for unequal 11 variances is being conducted. The function returns the probability associated with the function. C. Distributional Statistics Function: =CHIDIST(X,DoF) Purpose: Computes the one tailed probability of the Chi-square distribution. Argument(s): X is a value or cell reference to a value at which to evaluate the function and DoF is the degrees of freedom of the distribution or a cell reference to where the value is located. Function: Purpose: =CHIINV(Prob,DoF) Computes the inverse of the one tailed probability of the Chi-square distribution. Argument(s): Prob is the probability associated with the Chi-square distribution and may be any numeric between 0 and 1 inclusive (or a cell reference to such a number) and DoF is the degrees of freedom of the distribution or a cell reference to where the value is located. Function: =FDIST(X,DoF1,DoF2) Purpose: Computes the F probability distribution for two datasets. Argument(s): X is a value or cell reference to a value at which to evaluate the function and must be a non-negative number, DoF1 is the numerator degrees of freedom and DoF2 is the denominator degrees of freedom. Both degrees of freedom must be 1 or greater. Function: =FINV(Prob,DoF1,DoF2) Purpose: Computes the inverse of the F probability-distribution. Argument(s): Prob is the cumulative probability associated with the F distribution and may be any numeric between 0 and 1 inclusive (or a cell reference to such a number), and DoF1 is the numerator degrees of freedom and DoF2 is the denominator degrees of freedom. As for FDIST, both degrees of freedom must be greater than or equal to 1. Function: =NORMSDIST(Z) Purpose: Computes the standard normal cumulative distribution. Argument(s): Z is a value or cell reference to a value at which to evaluate the function. Function: =NORMSINV(Prob) Purpose: Computes the inverse of the standard normal cumulative distribution. Argument(s): Prob is the probability associated with the normal distribution and may be any numeric between 0 and 1 inclusive (or a cell reference to such a number). 12 Function: =TDIST(X,DoF,Tails) Purpose: Computes the Student’s t-distribution. Argument(s): X is a value or cell reference to a value at which to evaluate the function, DoF is the degrees of freedom of the distribution or a cell reference to where the integer value is located and Tails is a 1 or a 2 indicating whether the one tailed probability or two tailed probability value is required. Function: =TINV(Prob,DoF) Purpose: Computes the inverse of the Student’s t-distribution. Argument(s): Prob is the probability associated with the two tailed Student’s tdistribution and may be any numeric between 0 and 1 inclusive (or a cell reference to such a number) and DoF is the degrees of freedom of the distribution or a cell reference to where the value is located. 1.4 ADD-IN STATISTICAL OPTIONS In addition to the standard built-in functions described above, Excel also offers the user the opportunity to add additional statistical functionality through an analysis took pack. This functionality is added through the Tools menu bar at the top of the Excel screen. Selecting Tools and then Add-Ins… results in a tick box of available functionality options being displayed. This box is displayed as Exhibit 6. Selecting the Analysis ToolPak and OK installs the statistical functionality described below. 13 Exhibit 6: List of add in functionality available in Excel Once the Analysis ToolPak has been installed, the Tools menu bar includes a Data Analysis option that offers the range of statistical options displayed in Exhibit 7. Each of these techniques offers a comprehensive range of data identification and method selection options that are described within the technique load function using a form of Wizard® identical to that observed for the built-in statistical functions. The approach is illustrated below for two of the more popular techniques. The main difference between the data analysis routines described here and the built-in functions is that these produce multiple cell output that can be displayed on a different worksheet if required. If this is not required then you need to make sure that your existing sheet has sufficient free cells to the right and below the selected output cell to avoid overwriting existing spreadsheet data. 14 Exhibit 7: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 17. Add in Data Analysis Options available through the ToolPak Anova: Single-Factor Anova: Two-Factor with Replication Anova: Two-Factor without Replication Correlation Covariance Descriptive Statistics Exponential Smoothing F-Test: Two-Sample for Variance Fourier Series Histogram Moving Average Random Number Generation Rank and Percentile Regression Sampling t-Test: Paired Two-Sample for Means t-Test: Two-Sample assuming Equal Variances t-Test: Two-Sample assuming Unequal Variances z-Test: Two-Sample for Means The approach and outcome of using the Descriptive Statistics procedure are displayed in Exhibit 8 and Exhibit 9 respectively. As illustrated in Exhibit 8, the input panel (Wizard) that is opened when Tools → Data Analysis → Descriptive Statistics is selected allows you to select an input range such as A1:B21 in this case and indicate that the variables are stored in column format. Furthermore, ticking the available box allows the procedure to recognise that variable names are included in the first row of the selected array. Finally, the displayed procedure shows that the output is required to be placed in cell D1 (or at least that this cell will be the upper left most cell of the output range) and that summary statistics and the 95% confidence level for the means are required. Exhibit 9 displays the outcome once OK is selected. For each of the two variables, the procedure gives the mean, standard error of the mean, median, mode, standard deviation, sample variance, kurtosis, skewness, range, minimum, maximum, sum, count and 95% confidence Level. 15 Exhibit 8: Illustration of the Descriptive Statistics procedure capture screen Exhibit 9: Illustration of the output from the Descriptive Statistics procedure 16 As a final illustration of the statistical procedures available through the add-in Analysis ToolPak, Exhibits 10 and 11 present the approach and outcome of the Regression procedure. As illustrated in Exhibit 10, the input panel (Wizard) that is opened when Tools → Data Analysis → Regression is selected allows you to select an range for the dependent or Y-variable such as A1:A21 in this case and a range for the independent or X-variable or variables such as B1:D21 in this case. Furthermore, ticking the Labels box allows the procedure to recognise that variable names are included in the first row of the selected arrays. Finally, the displayed procedure shows that the output is required to be placed in cell F1 (or at least that this cell will be the upper left most cell of the output range). Although not selected for this illustration, the Regression procedure allows you to request various graphical plots that may be of interest. Exhibit 10: Illustration of the Regression procedure capture screen Exhibit 11 displays the outcome once OK is selected. The regression statistics include the multiple-R, the R-square, the adjusted R-square, the standard error of the regression 17 and the number of observations. Additionally the regression ANOVA table is produced and it provides the regression F-statistic and its associated probability. The last table provided as output gives the estimated intercept and slope coefficients for each independent variable as well as their associated standard errors, t-statistics and probabilities. Finally, the 95% confidence limits for the intercept and coefficients are presented (together with an additional confidence limit if this has been selected as part of the regression requirement). Exhibit 11: Illustration of the output from the Regression procedure Microsoft product screen shots reprinted with permission from Microsoft Corporation.