17:610:511 Research Methods EXCEL Dan O'Connor & Soyeon Park 2003 Using Excel When Doing Descriptive Statistics Excel can be useful when doing statistical analyses and graphs. Excel is widely available whereas SPSS is usually found in university settings. Thus, this handout recognizes the importance of having you try to use Excel for course 511 whenever that is possible. It is assumed that you know how to input data into a spreadsheet. It can be efficient to put the name of each variable in the first row of the first column above the variables’ scores. Save your data set using three different names—just in case a spreadsheet gets messy and you want to start over Descriptive Statistics Use the Tools bar and press Data Analysis… [If Data Analysis is not on the Tools bar, then you will need to unzip it and load it into Excel. You do this on the Tools bar by pressing Add-Ins and then punching the first item: Analysis ToolPak.] For simple descriptive statistics, pull down the Tools bar, click Data Analysis, and then click Descriptive Statistics and press OK. This brings up a dialog box with two sections: Input and Output Options. Input: Fill in the location of the cells of the Input Range (e.g., a2:a21). Or, you can use the mouse to go to the spreadsheet and highlight cells A2 to A21 and your data’s cell locations will be automatically entered into the Input Range box. Your data will stay highlighted and flashing if you use this latter procedure. [Note: Excel will restate what you type in the Input Range box by adding some additional code; for example, a typical entry might be restated as: $A$2:$A$21 Next, indicate that your data is: Grouped By: Columns. If appropriate, punch the box indicating: Labels In First Row. Output options: Excel wants to know where you want the descriptive statistical results to be: on the same page as your data OR on a different page (i.e., a New Worksheet Ply, which is the default option). You might want to list the variable’s mean, median, etc., on the same sheet as your data. If so, punch: Output Range which then lights up a location box. You MUST click on that box to get the cursor there. In that box, give an arbitrary location for your descriptive statistics; for example, type in location f1:g15. [Note: Excel will restate this as: $F$1:$G$15. Optionally, you can also highlight the output location using the mouse.] You MUST punch: Summary statistics. This will get you the mean, median, etc. If you forget this step, Excel will remind you. After you click OK, Excel will print your statistics where you indicated. It will be labeled Column1 and you will need to expand each of the two columns. Do this by moving the cursor to top of the column and between alphabetic letters; cursor will change to a double arrow, , and this can be used to expand each column. Graphic Displays: Histogram Pull down Tools and click on Data Analysis again. Click on Histogram in the dialog box. Fill in the Input Range (e.g. a2:a21) and the Output Range (e.g. a30:d60). Leave Bin Range alone for now. Be sure to punch Chart Output. Then click OK. Excel will return to your worksheet; move down to the location area you specified and your Bin and Frequency will be listed to the left of your histogram. The histogram will be 610:511 Research Methods EXCEL compressed and appear as a bar chart. Note the number of Bin categories. You may want to change this by typing different scale values to break your frequency distribution into six to eight groupings of data. Go back to your original sheet and type these new bin numbers (e.g., h1:h8). Here is the reason for doing this: histograms should have between six to eight categories for small samples. If your number range goes from 1 through 14 then type in 2 4 6 8 10 12 14 in separate cells. (Note that this puts the histogram into seven categories.) Go back to the histogram dialog box and insert the cell locations for this new bin. Change the Output Range (e.g. e30:g60). To eliminate the word “More” (an option): After your histogram is produced, highlight the two columns with the new bin numbers and frequency counts and then double click the histogram or do a right click. In the dialog box, select Source Data and change the value to eliminate the location of the cell using the word “More.” If your range had been from 10 to 40 (about 30 on a scale), divide this by 7 to get 4 numbers per bin. In this case, you might specify the following as bins: 10, 14, 18, 22, 26, 30, 34, 38. You may need to play around with the number of bins you will use. Try to get six to eight groupings of the data. Your histogram will need at least two fixes: expand its size, and pull its bars together to show it reflects continuous data (and is not a bar chart). Expand graph’s size: If you click your histogram you will be able to drag a corner using a diagonal arrow (e.g. in the lower right corner, use the mouse to get an arrow such as . You can then drag this arrow down and to the right to get a larger graph. Now, double click on one of the histogram’s bars. This should pull up a page 2 dialog box, Format Data Series, with six folders. [If you pull up Format Axis by mistake, just close it.] Go to the Options folder and change the Gap width from 150 to 0. You will see the bars come together. Close this box and return to the histogram. You can change the titles and axes labels directly by retyping them. If you should pull up a three folder box (usually with a right mouse click but sometimes with a double click), Format Chart Title or Format Axis Title, simply close them. When you click on text, your mouse will eventually go to a straight line where you can change text. You may have to double click a vertical axis label to get it horizontal, change the label, and then double click it back to its vertical position. Do not spend too much time on the histogram. You can delete extra histogram by clicking on the entire graph (it gets surrounded by lines with solid boxes in each corner and at each midpoint); then press delete. Bar charts: Two types of bar charts can be drawn appropriate to our purposes: one of the number of cases for each category in a categorical variable; the other of a numerical variable’s sub-group means. Below is a brief explanation of how to do each type. Simple bar graph using one categorical variable: Assume you have a gender variable showing that the sample has 12 men and 8 women in it. The simplest and fastest way to produce a bar graph with these data is to enter the numbers 12 and 8 into two adjacent cells, go directly to the Chart Wizard (the symbols for Chart Wizard is a bar chart next to the globe on the standard tool bar). Open up the Chart Wizard and specify a Column Chart type (it should already be highlighted as the first on the 610:511 Research Methods EXCEL chart list). For Chart sub-type, specify the first chart on that list (also, already highlighted). Go to Next step. Step 2 should already show your bar chart with 12 and 8. (If not, indicate the location of the summary data, and specify column data). Step 3, Chart Options, allows you to type in the Chart title, X and Y axis titles, and specify gridlines, legend, data labels, and data table. You can ignore most of this or experiment with it since you can see directly what it does as you select various options. Then, go to Step 4 OR go to Finish. At Finish, your chart will appear on your spreadsheet. Using COUNTIF to count cells: Let us assume you had a very large sample and did not have a quick count of the number of men and women. You can instruct Excel to count these for you by using the COUNTIF function. First, click on a blank cell where Excel will put the results. Then click the Paste Function button, fx (on main toolbar line between and AZ sort buttons). In the All or Statistical categories double click the COUNTIF function. Fill in the dialog boxes. For the Criteria dialog box simply fill in any cell with the value you want to use for counting. Then, do this same procedure filling in a different Criteria for another sub-group. Bar Chart using numerical (continuous) variable broken into subgroups by categorical variable: Since Excel is not set up the way SPSS is, you will have to produce the separate table SORTED by your categorical variable. That way you can easily fill in the sequence of cells to produce descriptive statistics, histograms, and bar charts. SORT is a simple procedure. (Again, make a copy of your data; also, remember that the edit menu has an Unsort option after sorting to undo anything you did not want permanent.) page 3 To SORT: highlight a single cell in the data matrix (e.g. M representing one score indicating a Man). Then, click the ascending sort key on the main tool bar (AZ with arrow pointing down). All contiguous columns will be sorted keeping the integrity of each row intact. Note that an adjacent column may be sorted; always try to keep a blank column between your data matrix and output results. Now you can treat each subgroup as you did the overall data— simply by identifying the location of each subgroup (e.g. for Input range, specify a2:a11 to analyze Men; you might specify a12:a21 for Women). You can then use the results of the descriptive statistical analyses of each subgroup to create bar charts. Simply give the location of the means as the two input cells needed to produce a column bar chart. There are several other ways to accomplish this same thing. A second method to create a bar chart using subgroup averages can be accomplished by pressing the Function Paste key and selecting AVERAGE. Identify the range of your subgroup’s continuous variable and report the average in a blank cell. Do this for the other sub-group average. Then use this as input for a column bar chart. Thirdly, you can create cells with formulas in them to compute counts or averages or sums by pressing the equal sign next to the input window (below the tool bars but above the spreadsheet data. Then hit the left down arrow for functions available. This method takes some practice and you might want to experiment with it. Usually, the Function Paste key can do this and keep your output results clear. More experienced Excel users will create shortcuts using formulas embedded in cells. 610:511 Research Methods EXCEL Lastly, you can use the SORT button for small data sets but you should know that recent versions of Excel have a clever device for larger data sets: the Pivot Table Wizard. Click on the Data menu and then click Pivot Table Report. Excel will then walk you through this procedure using three or four steps (similar to the Chart Wizard). You can experiment with pivot tables on your own. Scattercharts: Excel can easily create scattercharts in several different ways. For now, you can go to the Chart Wizard and select XY Scatter. Select the first option among the charts (i.e., not the ones with lines). In Step 2, fill in the range of TWO numeric variables (e.g., a2:b31). Indicate that this Series is in: Columns. In Step 3, you can modify the chart as you see fit. Then, go to Finish OR go to Step 4 and go to Finish. Later, we will create scattercharts as a byproduct of simple linear regression. Inferential Statistics t-tests Computing a t-test in Excel is straightforward AFTER you have SORTed your data matrix. SORT is described above but its essentials will be repeated here. Go to your raw score data matrix in the spreadsheet. Hopefully, you will have a blank column to the right of the matrix and a blank row below it (or else SORT will have an effect on that adjacent data). To do the sort, simply highlight one cell in your categorical variable and then press the AZ key on the toolbar and, instantly, the entire matrix will be sorted by your categorical variable. To do the t-test, go to Tools, then Data Analysis, then t-Test: Two-Sample Assuming Equal Variances. This brings up a dialog box requesting Variable 1 Range. Here, put in cells for category A variable by entering the Dependent, page 4 continuous variable’s cell locations, e.g., b12:b21). Then, enter Variable 2 Range (put in cells for category B variable by entering the Dependent, continuous variable’s cell locations, e.g., b22:b31). You do not need to fill much else into this dialog box unless you want to experiment with it. Press OK and the results will display the mean, variance, etc. statistics. You will need to expand the column headers to read the results. The t-statistic will be reported (as t Stat). It will be a positive OR negative number (check your means to determine how Excel computed it). You can interpret this t Stat by comparing it to the last line in the results table: t Critical two-tail. This gives you the same t crit you can get from the table in the Williams text. Above this number in your results table in Excel is the probability of achieving this finding by chance alone. You might see: P(T<=t) two-tail 1.43669E-05. You can interpret this as: the probability of achieving this t-actual value by chance is less than or equal to .000014367, or less than one chance in 50,000. (The 50,000 was arrived at by taking the unit reciprocal of .00002; thus, 1 .00002.) Correlation & Linear Regression For correlations, go to Tools, Data Analysis, and then Correlation. In the Input Range box type the cells of your two continuous variables for which you want a correlation coefficient (e.g., b12:c31). Confirm that the data is Grouped By: Columns. Punch OK and you will get a simple table of results which show the correlation coefficient between your two variables; e.g., Excel might report that between Column 1 and Column 2, there is a correlation of 0.791715. You can interpret this r value using the table I gave you. Interpretation of a correlation needs to be done in concert with the scatterplot you constructed using these same data. 610:511 Research Methods EXCEL You can also get the correlation by doing a simple linear regression on the two variables. Go to Tools, Data Analysis, and then Regression. Enter the locations of each variable (e.g., at Input Y Range, enter b12:b31; at Input X Range, enter c12:c31). After OK, considerable output appears and we will discuss what this means in class. One nice feature of Excel is that the scatterplot can add the regression line, regression formula and R² value to the chart (although you may have to reduce the type size of the formula and delete a box with Series… in it). To request a correlation, go to Tools— Data Analysis—Correlation and you can identify as many variables as you want for the input box. Excel then reports a correlation matrix but does not indicate significance levels. When constructing the scatterplot using Excel, have the Dependent Variable to the right of the Independent Variable. That way, the DV will plot on the proper Y axis and the IV will plot on the X axis. To add a trend line to a scatter plot, click on the total scatterplot [so all four corners are highlighted] and then go to the upper tool bar and click on Chart. The pull down menu has Add Trendline and you will select that. Then go to the Options Dialog box and check two boxes: Display Trendline on Chart and Display R-square Value on Chart. An alternative is to request a regression from Tools—Data Analysis and my example in the separate handout does that. page 5