Microsoft Excel Users Guide to accompany Statistics: Unlocking the Power of Data by Lock, Lock, Lock, Lock, and Lock Excel Users Guide- 1 Statistics: Unlocking the Power of Data 0.) Getting Started Microsoft Excel is a simple and very widely used spreadsheet application. It is especially helpful for storing data, visualizing data, and doing simply algebraic manipulations like addition and multiplication. Excel is not really designed to do complex statistical analysis, and therefore some of the methods described in the later chapters of the textbook are not addressed in this guide. The Analysis Toolpak add-in does allow for basic statistics in Excel, however. For instructions on how to install and load the Toolpak, see Section 0.3 (but this is not needed for the material in Chapters 1 & 2 of the textbook). This guide applies to Microsoft Excel 2010 for Windows. Other versions of Excel for Windows and Mac OS are similar, but be aware that the specific steps described in this guide may not be the same for alternative versions. 0.1: Entering Data When you first open Excel, you will see a big empty spreadsheet. We would like to fill these empty cells with data! Typically we organize data in Excel just as we structure a data table; each column represents a variable and each row represents a case or unit. The variable labels are given in the first row of cells, and the values for each case are given in the rows below. To enter information, just click on a cell and start typing. The arrow keys are also useful for navigating between the cells. You can enter anything you want in each cell - text for categorical variables, and numbers for quantitative variables. As an example, let’s enter some of the student survey data from Table 1.1 of the textbook. The first row labels each variable, so cell A1 should be ID, B1 should be Gender, and so on. It helps to make these bold (by highlighting them and pressing CTRL+b) so that they stand out. The values for each student are then entered in the cells below. You can open another empty spreadsheet by clicking on the tabs at the bottom of the screen (by default these are labeled ‘Sheet1’, ‘Sheet2’, etc.). When you save your Excel Workbook as an ‘.xlsx’ file, all of your sheets are saved to a single file. 0.2: Manipulating Data and Basic Operations In Excel it is very easy to manipulate data; all you need to do is click on a cell and change its value. But be careful! When entering or manipulating data yourself it is easy to make a mistake, and it hard to identify mistakes and fix them later. Excel Users Guide- 2 Statistics: Unlocking the Power of Data Most basic operations for quantitative data are done using the formula bar fx above the spreadsheet. To specify how to compute the value of a cell, click on a cell and enter a formula in the bar above. All formulas should be preceded by ‘=’. The basic operations are ‘+’ for addition, ‘-‘for subtraction, ‘*’ for multiplication, ‘/’ for division, and ‘^’ for an exponent. If we enter ‘= (2*3)^2+4’ and press ↵, the value of the current cell will be 62+4 , or 40. We can also perform operations using other cells. So if we click on A3 and enter ‘=A1/A2’then press ↵, the value shown in A3 will be whatever is in A1 divided by whatever is in A2. Then, if we change the value in either A1 or A2, this changes the value in A3. Of course, if either A1 or A2 does not have a numeric value, this will result in an error. It is easy to use a formula to compute a group of cells, like an entire row or column. One way to do this is the click and drag feature. Let’s return to the student survey example from Section 0.1; we would like define a new variable that is the number of hours spent exercising minus the number of hours spent watching television in a week: Exercise – TV. Exercise is given in column E and TV is given in column F, so we’ll let column G be the difference. After labeling cell G1, we use the formula bar to give the difference for the first student: ‘=E2-F2’, in cell G2. Then, click on cell G2 and move your cursor to the bottom left corner of the cell. Click and hold here, then drag down to cover cells G3 to G11 and release. This should extend the formula so that G3=E3-F3, G4=E4-F4, and so on. This click and drag feature can be much more quick than separately entering the formula in each cell. 0.3: Loading the Analysis Toolpak The Analysis Toolpak for Excel is necessary for some statistical methods. All versions of Excel 2010 should have this toolpak as an option, but you probably need to load it first. To load it, click File in the top left corner and then select Options. This will open a new window. Select Add-Ins, then click the Go button near the bottom of the window. This will open another window. Check the Analysis Toolpak box and click OK. When you select the Data tab you should now see a Data Analysis icon at the top right of the screen. Excel Users Guide- 3 Statistics: Unlocking the Power of Data 1.) Using Excel in Chapter 1 1.1 Random Assignment for Experiments Excel can be used to randomly assign units to experimental groups. A simple way to do this is to use the RAND function. Entering ‘=RAND ()’ in the formula bar will generate a random number between 0 and 1 in the current cell. We can generate a random number for each experimental unit, and then assign groups based on these random numbers. For example, perhaps a college professor wants to randomize her class of 16 students into two equally sized groups. One group will receive Exam A, the other group will receive Exam B (this is similar to Example 1.27 in the textbook). She starts by listing the names of her students in the first column. The names are listed in alphabetical order. We can randomize the students to the two exam groups in three steps: 1.) Add a second column of random numbers, titled ‘Random’. To do this enter ‘=RAND()’ in the first row, then click and drag to cover the remaining 15 cells. 2.) Sort the names based on the random numbers. To do this highlight the 16 student names and then selecting Sort under the Data tab above. Select Expand the Selection and click Sort. This allows you to specify which variable to sort by. Select Random next to Sort by and click OK1. The 16 names now appear in a completely random order. 3.) We will assign the first 8 students to Exam A and the next 8 to Exam B. Enter a third column titled Exam and enter A in the first 8 rows and B in the next 8 rows. This gives the random exam assignment for each student. 1 This will generate a new set of random numbers, and new numbers will generate any time you change a cell. This is fine, but if you find it annoying you can change the settings so that the random numbers are generated once and then stay fixed. To do this select File->Options->Formulas, then choose Manual under Workbook Calculation. Excel Users Guide- 4 Statistics: Unlocking the Power of Data 2.) Using Excel in Chapter 2 2.1 Counts and Proportions for Categorical Variables The COUNTIF() function can be used to count the number of times something occurs in a range of cells. This is useful for calculating counts and proportions for categorical variables. For example, say we have a list of 16 students, and the award that each student would prefer to win (Olympic medal, Academy award, or Nobel prize) is given in cells B2 to B17. If we click on an empty cell and enter ‘=COUNTIF(B2:B17,”Olympic”)’ in the formula bar, the output will be the number of times ‘Olympic’ occurs in cells B2 to B17. Note that the colon is used to indicate a range of cells – we can either type this into the formula bar or use a curser to highlight the desired range. To get a proportion, divide the total count by the number of units, for example: ‘=COUNTIF(B2:B17,”Olympic”)/16’. 2.2 Bar Charts and Pie Graphs Simple graphs can be created by highlighting the data, clicking the Insert tab above and selecting the desired chart. Bar charts and pie charts require a name and count for each category. Consider the award preference data in Section 2.1. We have already found the count for each category using the COUNTIF() function. Now, we simply highlight the three category names with their counts, and select Insert->Pie Chart -> 2D Pie Chart above. This will automatically place a pie chart of the selected data over the spreadsheet, and this chart will also be saved when we save as A a ‘.xlsx’ file. A bar chart can be created similarly by highlighting the same data and selecting Bar Chart above. We can modify the look, labels and features of the graph by using the Chart Tool toolbar above. Excel Users Guide- 5 Statistics: Unlocking the Power of Data 2.3 Histograms The standard version of Excel does not have the capability to automatically make a histogram from quantitative data. However, we can make one using the Bar Chart option with just a little bit of work. We need to first define and label our bins, and then use the COUNTIF() function to count the number of units that fall within each bin. As an example, let’s add a third column to the student survey from Section 2.1 and 2.2, Exercise (the number of hours spent exercising per week). Then, let’s label our bins (of width 5) somewhere else on the spreadsheet: 0 to 5, 5-10, 10-15, and 15-20. In this simple example it’s possible to manually count the number of units that fall into each bin, but let’s do it automatically. For example, = COUNTIF(C2:C17,”<=10”) − COUNTIF(C2:C17,”<=5”) counts the number of Exercise values between 5 and 10 (the ‘<=’ stands for “less than or equal to”). Now, we can create our histogram. Highlight the bin labels and count data, and select the Column -> 2D Column chart option above. This generates a histogram of the data (just a column bar chart of the count for each bin). We can improve the look of the chart by removing the space between bars. Right-click on one of the bars, select Format Data Series, and move the Gap Width scroller all the way to No Gap. Then, we can label our vertical axis by selecting Chart Tools -> Layout ->Axis Titles ->Primary Vertical Axis ->Rotated Title above. We enter the axis label “Frequency” in the box provided next to the axis. We also enter a title, “Histogram of Exercise Times”, by selecting the Chart Tools -> Layout -> Chart Title -> Above Chart option. Excel Users Guide- 6 Statistics: Unlocking the Power of Data 2.4 Mean, Median, Standard Deviation, and Percentiles The mean, median, and standard deviation of a quantitative variable can be computed by entering ‘=AVERAGE()’, ‘=MEDIAN()’, and ‘=STDEV()’ in the formula bar. For example, consider the exercise times from Section 2.3, in cells C2 to C17. We can compute the mean of these exercise times by clicking on an empty cell and then entering ‘=AVERAGE(C2:C17)’ in the formula bar. Entering ‘=MEDIAN(C2:C17)’ gives the median for the exercise times and ‘=STDEV(C2:C17)’ gives the standard deviation. The ‘=PERCENTILE(range, x)’ function computes the x’th percentile in a range of values. Here ‘x’ is a number between 0 and 1, so x=0.95 corresponds to the 95th percentile. For example we can compute the first quartile (25th percentile) of the exercise values by entering ‘=PERCENTILE(C2:C17,0.25)’ in the formula bar. The MIN() and MAX() functions are also useful. MAX(C2:C17) gives the maximum value among the exercise times, and MIN(C2:C17) gives the minimum. 2.5 Boxplots Excel does not have the capability to generate boxplots automatically. They can be created manually, but this involves several steps (for more details, see the Microsoft support article at this link: http://support.microsoft.com/kb/155130). We suggest using other software (such as the StatKey applets: http://lock5stat.com/statkey/) to create boxplots. However, Excel can be used to find the five-number summary (as in Section 2.4), and then it is straightforward to create a boxplot by hand. 2.6 Correlation, Scatterplots, and Linear Regression The function ‘=CORREL(range1,range2)’ can be used to calculate correlations. For example, let’s add another quantitative variable to the student survey data from Section 2.3, the number of hours spent watching TV. We wish to find the correlation between Exercise (in column C) and TV (in column D). We choose an empty cell and enter ‘=CORREL(C2:C17,D2:D17)’, which computes the correlation between the two variables. Excel Users Guide- 7 Statistics: Unlocking the Power of Data The Scatter option can be used to make a scatterplot of two quantitative variables. To illustrate, we highlight the data for Exercise and TV and choose Insert -> Scatter ->Scatter with only Markers. We use the Chart Tools options above to add axis labels and a title. After creating our scatterplot, we can select the Chart Tools -> Layout -> Trendline -> Linear Trendline option. This displays the least squares regression line on the chart. To display the equation for this line in the form y=ax + b, we select Trendline -> More Options and then check the Display Equation on Chart option. Excel Users Guide- 8 Statistics: Unlocking the Power of Data 3.) Using Excel for Chapters 3 and 4 Excel has no built-in capabilities to do the bootstrapping and randomization procedures that are introduced in Chapters 3 and 4. There are free add-ins available for download online (such as PopTools: http://www.poptools.org/) that allow for these capabilities. However, these are clunky and not very intuitive. We suggest using other software (such as the StatKey applets: http://lock5stat.com/statkey/) to perform the methods described in Chapters 3 and 4. Excel Users Guide- 9 Statistics: Unlocking the Power of Data 5.) Using Excel for Theoretical Distributions (Chs 5-10) 5.1 Finding Normal probabilities The NORMDIST function can be used to calculate the probabilities of a normal distribution in Excel. Entering ‘=NORMDIST(x,mu,sigma,TRUE/FALSE)’ calculates a normal probability where ‘x’ is a value, ‘mu’ and ‘sigma’ are the mean and standard deviation of the normal distribution. For our purposes the final argument of the function will always be ‘TRUE’, which calculates the area of everything less than ‘x’ in the normal curve (set this value to ‘FALSE’ to compute the normal density, which is not relevant for the material in the textbook). The result will always be a probability between 0 and 1. For example, entering ‘=NORMDIST(1.96,0,1,TRUE)’ gives approximately 0.975 . To find the upper tail probability for ‘x’, use ‘=1 − NORMDIST(x,mu,sigma,TRUE)’. We can find the area between -1.96 and 1.96, for example, by entering ‘=NORMDIST(1.96,0,1,TRUE) − NORMDIST(-1.96,0,1,TRUE)’, which is about 0.95. The NORMINV function takes an area under the normal curve and gives a value. Specifically, ‘=NORMINV(p,mu,sigma)’ gives the value ‘x’ such that the area of everything less than ‘x’ on the normal curve is ‘p’. For example, entering ‘=NORMINV(0.975,0,1,TRUE)’ gives approximately 1.96. As an illustration of how we can use normal probabilities in Excel, consider the Gallup poll data in Example 6.7 of the textbook. In this example we test the hypothesis that the proportion of American adults who approve of the way Congress is handling its job is p=0.20. Under this hypothesis, the sample proportion from the Gallup poll of n=1013 individuals has distribution 𝑝̂ ~𝑁 (0.20, √ 0.20(1 − 0.20) ). 1013 In Excel, we enter the mean 0.20 in cell C1 and compute the SE in cell C2: ‘=(0.2*(1-0.2)/1013)^0.5’. The sample proportion 𝑝̂ is 0.19, so we can use the NORMDIST function to calculate a onesided p-value: ‘=NORMDIST(0.19,C1,C2,TRUE)’. 5.2 Finding t-Distribution probabilities The TDIST function is used to calculate probabilities from the t-Distribution with specified degrees of freedom. Entering ‘=TDIST(x,df,tails)’ calculates the probability using a standard tDistribution with ‘df’ degrees of freedom, for value ‘x’. If ‘tails’ is 1 then a one-sided probability is given (the area under the curve less than ‘x’), and if ‘tails’ is 2 then a two-sided probability is given (the area less than ‘x’ and greater than ‘-x’). For example, ‘=TDIST(1,10,2)’ gives the result of approximately 0.34, whereas ‘=TDIST(1,10,1)’ returns 0.17. Excel Users Guide- 10 Statistics: Unlocking the Power of Data 5.3 Finding Chi-Square probabilities The CHIDIST function is used to find Chi-Square probabilities. Entering ‘=CHIDIST(x,df)’ returns the area greater than ‘x’ in a standard ChiSquare distribution with ‘df’ degrees of freedom. For example, ‘=CHIDIST(3,10)’ returns approximately 0.98. 5.4 Finding F-distribution probabilities The FDIST function is used to find F-distribution probabilities. Entering ‘=FDIST(x,df1,df2)’ returns the area less than ‘x’ in a standard Fdistribution with ‘df1’ numerator degrees of freedom and ‘df2’denominator degrees of freedom. For example, ‘=FDIST(2,5,10)’ returns approximately 0.16. Excel Users Guide- 11 Statistics: Unlocking the Power of Data 6.) Using Excel for Tests for Means (Chapter 6) Excel can be used to perform some hypothesis tests that use a theoretical distribution automatically, but the options are somewhat limited. For example, you can use the Analysis Toolpak to do a test to compare two means (as described below), but there are no automatic procedures for intervals or tests for proportions. We give instructions for some of these tests. Two sample t-tests The Analysis Toolpak add-in (see Section 0.3) makes it possible to run two sample t-tests quickly. To do a twosample t-test, select Data->Data Analysis and then t-Test: Two-Sample Assuming Unequal Variances. There are also similar choices in this window for a two-sample t-Test assuming equal variance, and a t-Test for matched-pairs data. After choosing a test, a window pops up that allows you to specify the data range for variable 1 and the data range for variable 2 (these are the two sample sets you are comparing). You can also specify the hypothesized difference (this will usually be 0) and a significance threshold. Click OK to run the test. This will open a new sheet (by default) with several statistics for the test, including summary statistics for each variable, a one-tailed p-value, and a two-tailed p-value. Excel Users Guide- 12 Statistics: Unlocking the Power of Data 7.) Using Excel for Chi-square Goodness-of Fit (Ch. 7) Chi-Square Test for Goodness-of-Fit A chi-Square goodness-of-fit test can be performed automatically using the CHITEST function. Entering ‘=CHITEST(obs_range,exp_range)’ computes a Chi-Square p-value for the observed counts given in cells ‘obs_range’ versus the expected counts given in cells ‘exp_range’. For example, if the observed counts are given in cells D to G in row 2, and the expected counts are given in cells D to G below in row 3, ‘=CHITEST(D2:G2, D3:G3)’ will compute the goodness-of-fit p-value in one step. If you have the proportions under the null hypothesis in cells of the spreadsheet, you can use a formula to compute the expected counts by multiplying those cells by the sample size. Chi-Square Test for Association for Two Categorical Variables There are no facilities in Excel for doing the chi-square test for association for two categorical variables. Excel Users Guide- 13 Statistics: Unlocking the Power of Data 8.) Using Excel for ANOVA for Means (Ch. 8) ANOVA for Means The Analysis Toolpak add-in (see Section 0.3) makes it possible to run a one-way ANOVA for difference in means analysis quickly. To do an ANOVA test, select Data->Data Analysis and then ANOVA: Single Factor. A window pops up that allows you to specify a range of data values. The groups for ANOVA are taken to be the different rows or columns in this range. For example, if there are 3 experimental groups and 6 values for each group, the data can be organized in a 6 X 3 block of cells. You can also select a significance threshold. Click OK to run the test. This will open a new spreadsheet (by default) with several statistics for the test, including the between-group sum of squares, the within-group sum of squares, the F-statistic and p-value. Excel Users Guide- 14 Statistics: Unlocking the Power of Data 9.) Regression (Chs 9-10) 9.1 Inference for a single predictor The Analysis Toolpak add-in (see Section 0.3) makes it possible to automatically run a regression analysis. To do a regression analysis, select Data->Data Analysis and then Regression. A window pops up that allows you to specify a range of X (predictor) values and a range of Y (response) values. After we click OK a new spreadsheet will open (by default) with regression statistics. These include the R2 value (line 5), the coefficient and p-value for the intercept (line 17), and the coefficient and p-value for the slope (line 18). 9.2 Inference for multiple predictors The Regression option in the Analysis Toolpak can also be used to do inference for multiple predictors. We simply input multiple columns in the X Range field. After we click OK a new spreadsheet will open with inference statistics for each of the predictors (lines 18, 19, etc.).