Chapter 3: Relationships Between Quantitative Variables Scatter plot Summaries of single variables such as pie charts, histograms, means, and five-number summaries are useful. But the real power of statistics comes from its methods for analyzing relationships between variables. As we did with single variables in chapter 2, we start with graphical summaries and then look at numerical summaries. Look at Figure 3.1 in MOS. It is a scatter plot of hand span and height. Note which variable is plotted on the horizontal axis and which is on the vertical axis. When Excel is used to plot these data, here is what the result looks like: HandSpans and height 26.0 25.0 Handspan (cm.) 24.0 23.0 22.0 21.0 20.0 19.0 18.0 17.0 16.0 15.0 50 55 60 65 70 75 80 Height (in.) Now let's see how to create this scatter plot using Excel. 1. Start by opening the workbook handheight. You should find three variables, sex, height, and handspan, with 167 values in each column not counting the variable names in the first row. 2. Select all values in the height and handspan columns. That is, cells B1 through C168. 3. Click Insert, click on the Scatter option and select the first option. 19 The points on the scatter plot should be clustered tightly in the upper right corner of the graph. Let's change that. Notice that the points range from approximately 55 to 80 on the horizontal axis and approximately 16 to 26 on the vertical axis. We will make use of that information to adjust the scales on the two axes. 1. Double-click on one of the numbers that label the horizontal axis. This should open the Format axis window. 2. Click the Axis Options tab if it is not already in front. 3. To the right of the word Minimum, select Fixed and replace the number by highlighting the 0 and typing 55, to the right of Maximum, select Fixed and replace 100 with 80, and to the right of Major unit select Fixed and replace 20 with 5. Then click Close. 4. Double-click on one of the numbers that label the vertical axis. Again, click the Axis Options tab if it is not already in front. 5. Next to Minimum, click Fixed and replace 0 with 14, next to Maximum click Fixed and replace 30 with 26, and next to Major unit click Fixed and replace 5 with 2. Then click Close. 20 Your scatter plot should now look like the one shown above. You will probably note a few differences in format such as the color of the background and the absence of grid lines on the scatter plot above. I encourage you to experiment with right-clicking at various places on your graph and exploring the menu options that appear. Standard Excel does not offer the option to indicate two groups on a scatter plot using a third variable as in Figure 3.4. However, there are statistical add-ins for Excel that do offer this option. Trend Line, Regression, and Residuals Now follow the same procedure to create a scatter plot of driver age and maximum legibility distance of highway signs as shown in Figure 3.7 in MOS (refer back to Example 3.2). The data are in the workbook signdist. As before adjust the limits and major units of the scales so that they are roughly the same as shown in Figure 3.7. I used 15 to 85, with a major unit of 10, on the horizontal axis and 250 to 600, with a major unit of 50, on the vertical axis. Now it's time to add a trend line to scatter plot. 1. Move the cursor over one of the points on your scatter plot and right-click. 2. When the menu pops up, select Add Trendline. 3. When the Add Trendline window opens, select the Linear Trend/Regression type. 4. Click the box to the left of Display equation on chart. 5. Click Close. You should now see a straight line superimposed on your scatter plot similar to the one below. You should also see the equation y = -3.0068x + 576.68 which is the regression equation corresponding to the trend line. 21 The regression equation can be used to predict maximum sign legibility distance based on a driver's age. However, sometimes we want to go beyond the regression equation and obtain additional information about the relationship between the two variables. For example, as described in Section 3.2, we may want to analyze the residuals. In order to obtain residuals we use Excel's Regression command, one of the options in Data Analysis (in the Data menu). 1. Click on the tab for the work sheet containing the age and distance data in signdist. 2. Select Data > Data Analysis > Regression. In the Regression window several boxes must be filled in. a. b. c. d. In the box to the right of Input Y Range type the range of cells containing the distance data (for example, B1:B31). In the box to the right of Input X Range type the cells containing the age data. Click the box next to Labels. In the box to the right of New Worksheet Ply type a name for the new worksheet Excel will create (I wasn't very creative; I just named it regression.) 22 e. Click the box next to Residuals and the box next to Residual Plots. f. Finally, click OK, and you should get a new worksheet showing regression results. On the new worksheet, scroll down until you see RESIDUAL OUTPUT. Below that heading you will see the predicted distance and corresponding residual for each of the 30 points on the scatter plot. A partial listing is included below. On the upper right part of the worksheet you should see a scatter plot, like the one shown below, with age on the horizontal axis and residuals on the vertical axis. You can ignore the rest of the regression output for now. 23 24 Correlation It is quite simple to determine a correlation coefficient with Excel. Let's do that for the age and distance data, from Example 3.2, you have just been working with. 1. Click on the tab of the worksheet containing the age and distance data, and click on an empty cell where you want Excel to put the correlation coefficient. (I clicked cell E3.) 2. Click the Paste Function icon (fx). 3. Under Function category select Statistical, and then, under Function name, select CORREL. 4. Click OK. 5. In the Array 1 box type the range of data for either variable, age or distance. Then in the Array 2 box type the range of data for the other variable. Notice that, for correlation, it does not matter which variable you specify as array 1. Click OK. Correlation -0.801244651 6. You should now see the correlation between driver age and maximum legibility distance of highway signs as shown below. I entered the word correlation in the cell above the correlation coefficient since Excel does not automatically add a label. 25 Regression Output Finally, we will look at some of the elements of the regression output generated by Excel. Open the workbook pennstate1. Letting RtSpan be the explanatory variable and LftSpan be the response variable, use Excel's regression command (Data > Data Analysis) to perform a regression analysis. For each of the values below highlighted in yellow, confirm that you got the same result and compare it with the value shown on page 88 of the text– continued in MOS. SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.950 0.902 0.902 0.639 190 ANOVA df Regression Residual Total I ntercept RtSpan 1 188 189 SS 708.15 76.67 784.82 MS 708.15 0.41 Coefficients Standard Error 1.4635 0.4792 0.9383 0.0225 26 F Significance F 1736.38 0.000 t Stat P-value 3.05 0.003 41.67 0.000 Chapter 4: Relationships between Categorical Variables In chapter 3 we saw how Excel can help us analyze relationships between quantitative variables. In this chapter we address the same questions: (1) are the variables related and, if so, (2) what is the nature of the relationship? But this time we will consider categorical variables. We will use the pennstate1.xls data file. I recommend you also review the section headed Summarizing Categorical Variables with the Pivot Table in chapter 2 of this manual. Please use chapter 2 of this manual to create a pivot table using Form as the row variable and SQPick as the column variable. While you are doing that, I'm going to get another cup of coffee. In case you are having trouble finding the pivot table, here is what it looks like. You can easily convert these observed counts into row percents. 1. Move the cursor anywhere over the pivot table and right-click. 2. From the menu that drops down, select Show Values as. 3. Select % of row total. The new table should look like this: 27 You can obtain a table of column percentages by repeating the procedure above. However, this time, next to Show values as, select % of column. The new table looks like this. Finally, to get back to the original table of counts, repeat the above procedure and, next to Show data as, select No Calculation. Expected Counts If you examine Figure 6.4 in MOS, you will notice that it includes expected counts. Recall that these are the counts we would expect if there were no relationship between the two variables: order of letters on the questionnaire and letter picked by the student. Let's see how we can use Excel to compute these expected counts. 1. If your table contains percentages restore them to cell counts. Repeat the three step process above and, in the final step, under Show data as, select No Calculation. 28 2. 3. 4. 5. We want to leave the original table where it is, but create a copy and place it to the right of the original. Starting in the lower right cell of the table, click and drag to the upper left cell. The entire table should now be highlighted. Use Copy/Paste Special, pasting values only to place a copy of the table with its upper left cell in cell F2. (If your original table does not have its upper left cell in cell A2, use Copy/Paste to move it there.) Note that the following cell references will not work for you unless you have the upper left corners of your tables in cells A2 and F2, respectively. In cell G5 type =$D5*B$7/$D$7. Press Enter. Select cell G5, select Copy, select cells G5 through H6 (the four cells that contain the observed counts), and select Paste. Voila! Your copy of the original table should now contain the expected counts. What about that formula that you entered into cell G5? Where did that come from? Recall that a $ to the left of a row or column reference makes that reference "fixed" instead of "relative." For example, the reference $D5 has a fixed column reference but a relative row reference. When the formula is copied into cell G5, the row reference will change to a 4, but the column reference will remain constant as D. I recommend you study the formula until you are sure you understand what happens when you copy it to other cells. The ability to combine fixed and relative references is one of the features that make Excel so versatile. I also labeled the two tables, in cells A1 and F1, for use in the next section. The Chi-square Test We now have the two main ingredients for a chi-square test to determine whether the two variables are related in the population from which the sample was drawn. Those ingredients are a table of observed counts and a table of expected counts. We will use the Excel function Chitest to find the p-value for the test. 1. Click on the empty cell where you want the result of the chi-square test to be placed. 2. Click the Paste Function icon. Under Function category select Statistical, and then under Function name, select CHITEST. 29 3. 4. 5. 6. Click OK. In the Actual range box specify the range of cells that contain the observed counts, not including the totals. That is, type B5:C6. In the Expected range box type the range of cells containing the expected counts, G5:H6. Click OK. You should now see the p-value: p-value = 0.004689 Notice that I have added a title so that, if I look at my Excel worksheet in the future, I will remember what the number represents. Finding Chi-Square Excel's Chitest function does not provide the chi-square value, only the p-value. However, if you want to know the value of chi-square, it is easy to find. 1. Click on the empty cell where you want the chi-square value placed. 2. Click the Paste Function icon. Under Function category select Statistical, then under Function name, select CHIINV that stands for chi-square inverse. That is, CHIINV takes a p-value and "works back" to the chi-square value based on the chi-square probability distribution. 30 3. 4. 5. Click OK. In the Probability box enter 1- then the reference to the cell that contains the p-value (0.004689). Alternatively, you can enter the p-value directly instead of the cell reference. In the Deg freedom box enter the number 1. 6. Click OK. You should now see the chi-square value below to which I added a label. Chi-square = 7.995561 Finding The p-value Finally, if you already have a chi-square value and want to find the corresponding p-value, here's how to do it with Excel. Let's use a chi-square value of 7.995. 1. Click on the empty cell where you want the p-value to be placed. 2. Click the Paste Function icon. Under Function category select Statistical, and then under Function name, select CHIDIST that stands for chi-square distribution. That is, CHIDIST takes a chi-square value and finds the corresponding p-value based on the chi-square probability distribution. 31 3. 4. 5. In the X box type either 7.995 or a reference to a cell where you have already entered the chi-square value. In the Deg freedom box type the number 1. Click OK. You should now see .995311. To find the p-value, select a cell and type =1-.995311 to find the p-value 0.004689. 32 Excel® Technology Manual to Accompany Mind on Statistics © Cengage Learning. All rights reserved. No distribution allowed without express authorization. FIFTH EDITION Jessica M. Utts University of California, Irvine Irvine, CA Robert F. Heckard Pennsylvania State University State College, PA Prepared by Melissa M. Sovak California University of Pennsylvania, California, PA Australia • Brazil • Mexico • Singapore • United Kingdom • United States ISBN-13: 978-1-285-83862-5 ISBN-10: 1-285-83862-9 © 2015 Cengage Learning ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced, transmitted, stored, or used in any form or by any means graphic, electronic, or mechanical, including but not limited to photocopying, recording, scanning, digitizing, taping, Web distribution, information networks, or information storage and retrieval systems, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the publisher except as may be permitted by the license terms below. Cengage Learning 200 First Stamford Place, 4th Floor Stamford, CT 06902 USA Cengage Learning is a leading provider of customized learning solutions with office locations around the globe, including Singapore, the United Kingdom, Australia, Mexico, Brazil, and Japan. Locate your local office at: www.cengage.com/global. Cengage Learning products are represented in Canada by Nelson Education, Ltd. For product information and technology assistance, contact us at Cengage Learning Customer & Sales Support, 1-800-354-9706. To learn more about Cengage Learning Solutions, visit www.cengage.com. For permission to use material from this text or product, submit all requests online at www.cengage.com/permissions Further permissions questions can be emailed to permissionrequest@cengage.com. Purchase any of our products at your local college store or at our preferred online store www.cengagebrain.com. NOTE: UNDER NO CIRCUMSTANCES MAY THIS MATERIAL OR ANY PORTION THEREOF BE SOLD, LICENSED, AUCTIONED, OR OTHERWISE REDISTRIBUTED EXCEPT AS MAY BE PERMITTED BY THE LICENSE TERMS HEREIN. READ IMPORTANT LICENSE INFORMATION Dear Professor or Other Supplement Recipient: Cengage Learning has provided you with this product (the “Supplement”) for your review and, to the extent that you adopt the associated textbook for use in connection with your course (the “Course”), you and your students who purchase the textbook may use the Supplement as described below. Cengage Learning has established these use limitations in response to concerns raised by authors, professors, and other users regarding the pedagogical problems stemming from unlimited distribution of Supplements. Cengage Learning hereby grants you a nontransferable license to use the Supplement in connection with the Course, subject to the following conditions. The Supplement is for your personal, noncommercial use only and may not be reproduced, or distributed, except that portions of the Supplement may be provided to your students in connection with your instruction of the Course, so long as such students are advised that they may not copy or distribute any portion of the Supplement to any third party. Test banks, and other testing materials may be made available in the classroom and collected at the end of each class session, or posted electronically as described herein. Any material posted electronically must be through a passwordprotected site, with all copy and download functionality disabled, and accessible solely by your students who have purchased the associated textbook for the Course. You may not sell, license, auction, or otherwise redistribute the Supplement in any form. We ask that you take reasonable steps to protect the Supplement from unauthorized use, reproduction, or distribution. Your use of the Supplement indicates your acceptance of the conditions set forth in this Agreement. If you do not accept these conditions, you must return the Supplement unused within 30 days of receipt. All rights (including without limitation, copyrights, patents, and trade secrets) in the Supplement are and will remain the sole and exclusive property of Cengage Learning and/or its licensors. The Supplement is furnished by Cengage Learning on an “as is” basis without any warranties, express or implied. This Agreement will be governed by and construed pursuant to the laws of the State of New York, without regard to such State’s conflict of law rules. Thank you for your assistance in helping to safeguard the integrity of the content contained in this Supplement. We trust you find the Supplement a useful teaching tool. Excel® is a trademark of the Microsoft group of companies. Excel Technology Manual for Mind on Statistics 5e is an independent publication and is not affiliated with, nor has it been authorized, sponsored, or otherwise approved by Microsoft Corporation. Printed in the United States of America 1 2 3 4 5 6 7 17 16 15 14 13 Contents Chapter 1: Introduction .................................................................................................................. 1 Chapter 2: Turning Data into Information ..................................................................................... 5 Chapter 3: Relationships between Quantitative Variables ........................................................... 19 Chapter 4: Relationships between Categorical Variables ............................................................ 27 Chapter 5: Sampling: Surveys and How to Ask Questions.......................................................... 33 Chapter 6: Gathering Useful Data for Examining Relationships ................................................. 38 Chapter 7: Probability .................................................................................................................. 39 Chapter 8: Random Variables ...................................................................................................... 42 Chapter 9: Understanding Sampling Distributions: Statistics as Random Variables................... 53 Chapter 10: Estimating Proportions with Confidence ................................................................... 59 Chapter 11: Estimating Means with Confidence ........................................................................... 64 Chapter 12: Testing Hypotheses about Proportions ....................................................................... 73 Chapter 13: Testing Hypotheses about Means............................................................................... 81 Chapter 14: More about Regression............................................................................................... 88 Chapter 15: More about Categorical Variables.............................................................................. 91 Chapter 16: Analysis of Variance .................................................................................................. 94 iii Chapter 1: Introduction Organization of this manual This manual's goal is to help you learn to perform the computational parts of statistical analysis using Microsoft Excel. Each chapter is a companion to the corresponding chapter in your Mind On Statistics textbook. I have used the same chapter titles to avoid confusion. The manual uses examples from the text so that, after you have analyzed the data using Excel, you can always check you results with those in the text. When a method discussed in the text is not included in this manual that means that method is not a feature of standard Excel nor can the method be implemented using formulas presented in the text. Excel This manual is not a comprehensive guide to Excel. It focuses specifically on statistical analysis. Furthermore, it does not explain how to use a personal computer or how to work with the Microsoft Windows operating system, as this manual was written assuming that the user has had experience with this operating system. The first step, of course, is to make sure that the computer you plan to use has Microsoft Excel installed. Excel is part of the Microsoft Office suite of programs. If you are not sure whether Excel is on your computer, the fastest way to find out is to click the Start button, scroll up to Programs, and look for Microsoft Excel in the list of programs that appear on the screen. If you find it, click on the title, and Excel will open. As it does, you will see a small window appear on the screen temporarily that indicates what version of Excel is on the computer. In this manual I have written all instructions based on Excel 2003. However, if you have an earlier version of Excel, you will find that most of the instructions I give will work for you as they are presented here. For Excel 2010 the first screen should look like this Before we explore Excel, I want to call your attention to a convention I just used because I will use it throughout the manual. 1. 2. I will use red type for references to the textbook such as case study 1.1. I will use green type to refer to variables such as HrsSleep and data files such as pennstate1. 1 3. 4. I will use blue type when an action is called for, such as click or scroll or when I am referring to an Excel menu item such as File or an Excel element such as the function Average. I will use bold type for a reference to a cell such as A3, text you are to type in a cell such as =4*A3, and the contents of a cell after you have carried out an instruction. If you are not familiar with all of the terms in these four statements, don't worry; we'll take care of that right now. You should now have Excel open on your computer screen. Across the top of the screen you will see a list of menu names, File, Home, Insert, Page Layout, etc. Most of these you are already used to seeing if you use Microsoft Word. Under these you should see several options associated with the Home menu such as text options, alignment options, copy and paste options and more. The rest of the screen is either dark gray or is white with a grid of vertical and horizontal lines as shown above. If the screen is dark gray, use the mouse to move the cursor to the File menu and click on New and a new workbook opens. Near the bottom of the screen you should see tabs labeled Sheet1, Sheet2, and Sheet3. When you open a new workbook, it contains three worksheets. You can add or delete worksheets as needed, but more about that later. You should also see that grid of lines I mentioned before. There should be headings across the top of the grid: A, B, C, etc. These are the column headings. Down the left side of the screen there should be numbers: 1, 2, 3, etc. These are the row headings. Where a column and a row intersect is called a cell, and its row and column designators refer to the cell. Thus D3 is the cell reference for the cell in the fourth column and the third row. Notice that cell D3 is outlined with a dark line in the figure below. The cell is the basic work unit within an Excel worksheet. Let’s have our first look at what you can do with a few cells in a worksheet. 1. 2. 3. In cell A1 type Temperature Conversion. Notice that not all of the text fit in cell A1, but Excel allowed it to spill over into cells B1 and C1. In cells A3 and B3 type Temp F and Temp C, respectively. In cell A4 type 68, and in cell B4 type =5/9*(A4-32). After you press the Enter key you should see the number 20 in cell B4. You have just converted a temperature in Fahrenheit, 68 degrees, into a temperature in Celsius, 20 degrees. 2 4. 5. 6. 7. Now click on cell B4. Notice the dark box around the cell. This tells you that cell is currently active. Just below the formatting toolbar, you should see the Formula Bar containing what you typed in cell B4. If you needed to edit what you typed in the cell, you would edit the contents of the Formula Bar. With cell B4 active, move the cursor to the home toolbar and click on the icon B. The 20 in cell B4 should now be in boldface type. Click on the B again, and the boldface goes back to regular type. To the right of the icons B, I, and U are four icons for aligning text. Click the icon to Center the contents of B4. Also click to Center the contents of cell A4. Finally, change the title in cell A1 to boldface type. This is what your Excel worksheet should now look like. Let’s review what you just did. You typed three kinds of contents into worksheet cells: text in cells A1, A3, and B3, a numerical value in cell A4, and a formula in cell B4. I want to show you one more feature of Excel before we leave our temperature example. 8. Type the number 70 in cell A5. Center the number in this cell. icon to copy the formula. 9. Click on cell B4. On the home menu, click the 10. Click on cell B5 to make it active, and then click on Paste. What do you see in cell B5? It should be the number 21.11111. 11. Let’s round this off to 21.1. Click on cell B5 to make it active. Move the cursor to the Decrease Decimal icon on the formatting menu. The icon looks like this: .00 >.0 12. Click on this icon. The number in cell B5 should now be 21.1111. Click the Decrease Decimal icon three more times, and cell B5 should now contain 21.1. However, it is important to realize that only the display of the number in the cell has been rounded. The number stored in the memory of the computer has not been changed. Your worksheet should now look like this. 3 Let’s review. You now know how to: 1. Open a new workbook 2. Make a worksheet cell active 3. Enter content into a cell 4. Change the format of the content, and 5. Copy the content of a cell and paste it into another cell. In later chapters you will discover how much time that copy/paste feature can save you. These are the basics. In the chapters that follow we will explore many additional capabilities of Excel, but you will use these basic features every time you work with Excel. Now let’s proceed to chapter 2 and do some statistics. A Note about Add-Ins The capabilities of basic Excel for statistical analysis are somewhat limited. If you are using Excel for this purpose, you should be aware that there are several statistics add-ins available that extend Excel’s statistical capabilities significantly. Some are sold commercially, and some are bundled with statistics textbooks. 4 Chapter 2: Turning Data into Information In chapter 2 of the textbook, you learned a variety of ways, both graphical and numerical, to summarize a set of numbers. In this chapter of the guide we will see how you can use Excel to help you with these tasks. Types of Data The first thing we need to discuss is types of data. This is very important since not all tools for summarizing data are appropriate for all types of data. Unfortunately, the computer cannot, in most situations, tell you which tools to use and which not to use. It will obediently use the wrong tool and give you results that are nonsense. So, you need to first identify what kind of data you have. Here is a rule of thumb that can help. If the data are represented by text, such as "left," "male," "yes," or "strongly agree," then it is almost certainly categorical data. If numbers, such as 3 or 6.514 represent the data, then it is likely to be measurement data. Try this rule on the list of eight questions and the resulting data in Section 2.1 of the textbook. Don't be misled by the text that gives units of measure such as hours, inches, and mph. Those terms are not part of the data. How many of the questions did you find that result in categorical data? If you said questions 1 (sex: m or f) and 3 (letter: S or Q), you are correct. The other six all result in measurement data. You do need to watch out for categorical data that have been coded numerically. For example, if you have data representing responses to question 1, the sex of the respondent, in Section 2.1 of the textbook, but the data are all 0 and 1, don't be misled. Instead of "m" and "f" the data could also be "coded" using 0 for male and 1 for female. Sex is still a categorical variable because the responses all fit in one of two categories and do not represent quantitative information. That is, we would not count or measure to determine which category a person belongs to. With this rule of thumb and caution in mind, let's get busy summarizing. I will point out, as we go along, a few ways that Excel tries to help you match the tool to the data. Summarizing Categorical Variables with the Pivot Table The first Excel tool that we will meet for summarizing data is also one of the most useful. It is called the pivot table and is found in Excel's Data menu. I will introduce you to the pivot table by showing you how to use it to create a summary of a categorical variable similar to Tables 2.1 and 2.2 in the text. First, you need to open Excel and then open the Excel data file YouthRisk03 (from the companion website) using the procedures described in chapter 1. While you do that, I'm going to get a cup of coffee. You should now see an Excel worksheet with data. You should see names in the top row: Sex, Grades, etc. There are five columns and 3042 rows of data (not counting the row with the variable names). Your screen should look like the picture below. Note that only the row of variable names and the first 16 rows of data are shown. You need to scroll down to see the rest of the data. Using the terms from Section 2.1 of the textbook, observe that there are five variables, one in each of the first five columns of the worksheet, 3042 observational units, in rows 2 through 3043, and that one observation, seatbelt use for observational unit 11 is the active cell, E12. The complete data set resides in the block of cells A2 to E3043. It is important to remember that, since we always use row 1 for the variable names, the row number of the last row of data will be the number of observational units plus one. For this data set that is 3042 + 1 = 3043. 5 Here are the steps of the procedure for creating a pivot table: 1. Click on a cell anywhere in the data set. 2. Now move the cursor to the top of the screen and click on the word Insert. That will cause a new menu ribbon to appear. 3. Click on Pivot Table. This opens the Create Pivot Table window. 4. Select the entire data set, including the labels in the first row are in the selected range of cells. If it is not, you can change row and column references in the Table/Range box to include the entire data set. 5. Under Choose where you want the Pivot Table report to be placed, select New worksheet. Click OK. You should now see the Pivot Table and Pivot Chart Wizard – Layout window shown below. 6. Find the Seatbelt button on the right and drag it to the Row section of the Layout template. Also drag the same Seatbelt button to the Data section of the Layout template. 6 You should now have a pivot table that looks like this. Notice that this table does not match the one in Section 2.3 of the textbook. The category names are in a different order and they have numbers in front of them. However, we can edit the pivot table to look like the one in the textbook. Move the cursor over cell A5 and click to make that cell active as shown above. The contents of cell A5 should now be visible in the Formula Bar. Click on the Formula Bar and position the cursor to the left of the word “Never.” Press the backspace key twice (to delete “1_”). Repeat this process for cells A5 through A8. Now the category names should be the same as those in the textbook table, but they are still in a different order. Right-click, that is, click the right button on your mouse, on cell A9 and, as that cell becomes active, a menu should appear. Move the cursor over the word Order and a sub-menu of options should appear. Click on Move to Beginning. Repeat the use of the Order options until your pivot table looks like this: If you want to convert the counts to percentages: 7 1. 2. 3. 4. Move the cursor anywhere over the pivot table and right-click. From the menu that drops down, click Value Field Settings. Select the Show Values As tab. Select % of column total from the drop down menu. 5. Click OK. Your table should now look like this: This table summarizes one variable, the frequency of use of seatbelts. So it is a "one-way" pivot table. Now let's create a "two-way" pivot table to summarize the frequency of seatbelt use for men and women separately. 1. Again, click any cell in the data and then click on Insert. As before, click Pivot Table. 2. Check to see that all of the data cells, including variable names, are selected, and click OK. 8 3. This time you should drag the Sex button into the Row section of the template, the Seatbelt button into the Column section, and the Seatbelt button into the Data section. You should now see a button in the Data section that says Count of Seatbelt. (Actually, you could drag either the Seatbelt button or the Sex button into the Data section and get the same result.) Does your table look like this? This table contains the same information as is shown in Table 2.2 in the text. Based on the numbers in the table, would you say that the order in which the letters are presented influenced the choices made by the students who answered? It might be easier to answer the question after looking at percents instead of counts. Repeat the procedure used above to change Field Settings. This time under Show values as, select % of row. Does your table look like this? 9 Visual Summaries for Categorical Variables: Pie Charts and Bar Graphs Tables of numbers like the tables you just created often come in handy in interpreting data. However, a well-chosen graph can not only be easier to understand but also be visually more powerful. Let's look at a couple of graphs that Excel can create. Look at Figure 2.2 in MOS, your textbook. We are going to create a pie chart very similar to that, but in order to do so we must first create another pivot table. Go back to the data in pennstate1 and use the procedure described above for one-way pivot tables to create a pivot table for the column headed RandNumb. When you drag the RandNumb button into the Data section of the template you will probably see Sum of RandNumb. To change this from sum to count, double click on the Sum of RandNumb button. The Pivot Table Field window should open. Under Summarize by, click Count, click OK, and then proceed as before. Your table should look like the one below. Next you will instruct Excel to use this table to create a pie chart. 1. Move the cursor over the table and click on any cell. 2. Now move the cursor to the top of the screen and click Insert. 3. Click Pie and select the first option. Check you pie chart against the one shown below. Note that it is not exactly like Figure 2.2 in MOS, but the difference is not significant. Repeat steps 2 and 3 above, but, this time, select Column under Chart type instead of Pie and select the first option. Note that it is essentially the same as the bar chart in Figure 2.2 in MOS. 10 Does your bar chart look like the one below? Before we finish our exploration of how to summarize categorical data, we will create one more graph a bar chart for two categorical variables. You might want to reread Example 2.2 in MOS and look at Figure 2.4. 1. First, in a blank Excel worksheet, create a table like Table 2.3. Since you do not have the original data from the survey of 479 children, you should type in category names and percentages to create a table like the one in the text. 2. Do not include the percent sign (%) or counts shown in Table 2.3. Instead enter each percent as a decimal fraction. For example, the cell in the row labeled Darkness and the column labeled No Myopia contains 90%. Type this into your table as 0.9. 3. Now press Enter and then click on the cell into which you just entered the 0.9. 4. In the tool bars at the top of the screen, find the % icon. Hold the cursor over it for a few seconds without clicking. The phrase Percent Styles should appear. Click on the icon. The contents of the cell should have changed from 0.9 to 90%. 11 5. 6. 7. Type the rest of the percents into the table as decimal values (0.09, 0.01, 0.66, etc.), highlight all of the numbers, by clicking and dragging over the numbers, and again click on the % icon. You do not need to include the Total column or row, as you will not include them in your graph. Now, highlight all of the percentages in your table along with the column labels (No Myopia, etc.), but not the row labels. Then click the Insert. Select Column and select the first option under 3-D Column. Your bar chart should look like the one below. Compare it with Figure 2.3 in MOS. Note that the colors in the chart below are the same as the ones in the text. This is not important except to demonstrate that you can change the formatting of a chart created by Excel. Finding Information in Quantitative Data: The Five-Number Summary We now move on to summarizing measurement or quantitative data. We will begin by creating the fivenumber summaries for Example 2.5 in MOS. To create these summaries you will use several of Excel's built-in statistics functions. But first a little data sorting is needed. It is sometimes the case that the way the data are arranged is not the way you need to have them arranged for the analysis you want to perform. So knowing how to rearrange the data can come in handy. In the pennstate1 workbook, the right hand span data are in what is called a "stacked" format. That is because right hand spans of women and those of men are stacked up in the same column. The only way we can tell which is which is to have another column, in this case the variable called Sex, to tell us which numbers are hand spans of women and which are those of men. What you need to do now is “unstack” the RtSpan column so that you have hand spans for men and hand spans for women in two different columns. 1. 2. 3. Find a blank worksheet in the Excel workbook containing the hand span data. If there isn’t one, move click on the tab without a name to create a new worksheet. Next click on the tab of the worksheet containing the data and click the letter at the top of the column containing the variable Sex. This is probably column A. The entire column should now be highlighted. Click the Home menu and select Copy. 12 4. 5. 6. 7. 8. 9. Now click the tab of the blank worksheet, click on cell A1, click on the Home menu, and select Paste. You should now have the Sex variable in the leftmost column (column A) of the new worksheet. Go back to the data worksheet, select the RtSpan column, and then copy/paste it into column B of the new worksheet. I’ll wait. Now we’re ready to sort data. In the new worksheet, click on any cell in the data, then click on the Data menu and select Sort. The Sort window should now be open. If the variable Sex is not selected in the Sort by box, click on the down arrow and select it. Click Add Level. In the Then by box, select RtSpan. Your Sort window should look like this. 10. Click OK. The data are now sorted. Next you will rearrange the hand span data into two columns. 1. 2. 3. 4. 5. Scroll down to the last row containing the word Female in column A. This should be row 104. Click on the cell containing the first male right hand span. Is this cell B105? It is in my worksheet. Now scroll to the last row that contains data, row 191, hold down the shift key, and click on cell B191. Go to the Home menu and click Cut. Finally, scroll back to the top of the worksheet, click cell C2, and click Home and then Paste. Whew! You’ve done it. The data are unstacked, and you are ready to compute five-number summaries. This may seem like a lot of work just to prepare the data for analysis, but once you have done it a few times, you will do it very quickly. There is one more small detail to see to. You need to define column headings (variable names) that reflect the new arrangement of the data. For example, I typed F Span in cell B1 and M Span in cell C1. 1. 2. 3. 4. Click on a blank cell to the right of the data. I chose cell E7. Type the word Median. Now move two cells to the right and type =MEDIAN(. Click on the first male hand span (cell C2), then hold down the shift key and click (that's called a shift-click) on the last male hand span (cell C88). Type a right parenthesis, ), and press the enter key. You should now see 22.5, the median of the male hand spans, in cell G7. You have just used one of Excel’s many built-in functions. Go back and highlight the cell so that you can review what you typed. Start with an equal sign, then type the name of the function, in this case median, and then parentheses containing any “arguments” required by the function. For the median function, the arguments required are the first and last cells that contain the data. By the way, I always type the function name in all capital letters to remind me that I am using an Excel function, but that is not required. There is an alternative way of entering a function into a cell. 13 1. 2. 3. 4. 5. 6. 7. 8. 9. Click the cell immediately under the cell in which you typed Median. Type the word Quartiles. Now click the cell two cells to the right, G8. Go to the top tool bar and click the Paste Function icon. It looks like this: fx. The Paste Function window should now be open on your screen. Scroll down until you find Quartile, click on it, and then click OK. In the Quartile window, click on the small red arrow on the right side of the Array box. The Quartile window collapses into a single box. Now click the top male hand span, scroll down, and then shift/click the last male hand span. All of the male hand spans should now be selected. In the box still on the screen you should see C2:C88. Click the small red arrow on the right end of the box, and the Quartile window should re-appear. In the box labeled Quart type 1 to indicate that you want the first quartile, and then click OK. The cell you first selected should now contain 21.75, which is the first quartile of the male hand spans. I’ll bet you are ready to compute the rest of the numbers for the male hand span five-number summary and then compute the summary for the female hand spans. Here are a few hints. To find the third quartile, enter 3 into the Quart box; to find the largest number in a data set, use the MAX function; and to find the smallest number, use the MIN function. When you have finished here is what you should have. Histograms, Stem-and-leaf plots, and Dotplots Unfortunately, Excel’s histogram is not one its best features. The procedure for creating a histogram in Excel is cumbersome, and the result does not look quite the way a histogram is supposed to look. Nevertheless, let’s create the histogram first; we can identify its strengths and weaknesses later. I want you to use Excel to create a histogram, like the one in Figure 2.7 in MOS, of women’s right hand spans. The process is made easier by the fact that you have already computed the five-number summary for this data. For a histogram we need to determine a set of categories into which the data will be grouped. We need to tell Excel what the boundaries of those categories are, what Excel refers to as the “bins.” We know – because we have the five-number summary – that these hand spans range from a low of 12.5 inches to a high of 23.25 inches. There are many ways we could define categories to cover this range, but let’s copy the categories used in Figure 2.5. The categories are 10 to 11, 11 to 12, 12 to 13, and so on up to 23 to 24. The bins, corresponding to these categories, for Excel are 11, 12, 13 up to 24. 14 1. 2. 3. 4. 5. 6. 7. 8. In the worksheet where you sorted the hand span data, select a column to the right of the data, type the word bins in the first cell and the numbers 11 to 24 in the cells under that, one number per cell. Next go to the Data menu and select Data Analysis. If Data Analysis is in the list of tools, skip to the step 4. If it is not in the list, you will need to add it. a. To do this, go to the File menu and select Options then click Add-ins. b. In the Add-ins window that appears you should see Analysis ToolPak. Click the box to the left of that, and then click OK. It may take Excel a few seconds to load the tool pack. Now go back to the Data menu, and you should find Data Analysis listed. When the Data Analysis window opens, scroll down to Histogram and then click OK. In the Input Range box, click the red arrow and select the range of female hand span data, including the variable name in the first row. In the Bin Range box repeat the procedure to select the list of bins, again including the name bins in the first row. Now click the Labels box, then click Chart Output, and finally click OK. The histogram may be quite squashed down. If so, click on it anywhere and, then move the cursor over the small black square in the center of the bottom of the histogram box. The cursor should change to a double arrow. Click and hold the mouse button down while you drag down until the histogram is large enough to be easily to understand. It should look like the one below. 15 Notice that the shape formed by the vertical bars in the histogram is very similar to that in Figure 2.7 in the text. However, there is a difference between the two histograms. The main feature that distinguishes a histogram from a bar chart is that, in a histogram, there are no gaps between the vertical bars. Note that this is the case with Figure 2.7. It is not true; however, of the histogram you have just created using Excel. The gaps between bars in a bar chart are there to emphasize that the bars represent distinct categories. For the same reason, the histogram should not have gaps because the bars represent categories that make up one continuous and uninterrupted range of numbers. It is a flaw in Excel's histogram that it is depicted as a bar chart. However, here is a histogram of the same data created with an Excel add-in. Histogram of women's right handspans 30 25 Frequency 20 15 10 5 0 <=11 11- 12 12- 13 13- 14 14- 15 15- 16 16- 17 17- 18 18- 19 19- 20 20- 21 21- 22 22- 23 >23 Right handspan (cm) Standard Excel does not include stem-and-leaf plots, dot-plots or box-plots among its data analysis tools. Thus we cannot use Excel to create plots similar to Figures 2.8 and 2.9 in MOS. As mentioned in chapter 1 of this manual, there are add-ins that extend the statistical capabilities of Excel. Several of these add-ins include menu options for creating one or more of these plots. Summary Measures Section 2.5 in MOS presents several summary measures for quantitative variables. Let's see how to use Excel to compute those measures. Specifically, we will use Excel to compute the mean, range, and interquartile range, in addition to the five-number summary encountered earlier. We will use theSongs on Student iPods data shown in Example 2.11 in MOS. Before you can compute the summaries, you will need to enter the data into Excel. Go ahead; I'll wait. 1. 2. 3. In cells E5 to E12 type the following eight labels in a column: mean, minimum, 1st quartile, median, 3rd quartile, maximum, range, and interquartile range. To compute the mean, use Excel's Average function. In the cell to the right of the word mean, type =AVERAGE(a2:a25). You already know how to find each of the numbers in the five-number summary. So go ahead and do that. The range is simply the largest speed minus the smallest speed. Thus in the cell to the right of the word range, you can type =F10-F6 since F10 should be the cell where you determined the maximum speed (using the MAX function) and F6 should be the address of the cell where you determined the minimum speed (using the MIN function). Finally, the interquartile range is the difference between the third quartile and the first quartiles. So, you can compute this using the quartiles you have already determined in the same way you just computed the range. When you are finished, your results should look like this. 16 If you want to check your formulas in cells F5 through F12, here is what they should be. Finally, you will compute a variance and a standard deviation. First, let's use the Songs data set listed in Example 2.11 of MOS. 1. Type these numbers in cells A2 through A25 in an empty Excel worksheet. Remember that, if you don't have any empty worksheets, you now know how to add one (reminder: look in the Insert menu). 2. Type a label in cell A1 (I used Songs) just to get in the habit of always using a label for a column of data. 3. Next in a blank cell type the word Variance. In the cell to the right of this type =VAR(A2:A25). 4. To obtain the standard deviation, use the Excel function STDEV in the same way you just used VAR. That is, type =STDEV(A2:A25). The results are: 17 Let's review what we have accomplished in this chapter. You can create one-way and two-way frequency tables (using Pivot Table and Pivot Chart Report from the Data menu).You now know how to use Excel to create a pie chart or a bar chart (using the Chart Wizard) and a histogram (using the Histogram command under Data Analysis in the Tools menu). You also know how to use Excel to compute a five-number summary as well as a mean, range, interquartile range, variance, and standard deviation (all using Excel’s built in functions). Along the way you learned how to unstack and sort data (using the Sort command under Data). When you have a set of data to analyze, it is a good idea to start by creating graphs and computing summary measures to “get a feel for” the data. It may not seem like we have covered a lot when it’s listed in one short paragraph, but you now have a useful array of tools for summarizing a set of data. 18