Uploaded by Bakers Delight

Excel Ch1-4

advertisement
Chapter 3: Relationships Between Quantitative Variables
Scatter plot
Summaries of single variables such as pie charts, histograms, means, and five-number summaries are
useful. But the real power of statistics comes from its methods for analyzing relationships between
variables. As we did with single variables in chapter 2, we start with graphical summaries and then look at
numerical summaries.
Look at Figure 3.1 in MOS. It is a scatter plot of hand span and height. Note which variable is plotted on
the horizontal axis and which is on the vertical axis. When Excel is used to plot these data, here is what the
result looks like:
HandSpans and height
26.0
25.0
Handspan (cm.)
24.0
23.0
22.0
21.0
20.0
19.0
18.0
17.0
16.0
15.0
50
55
60
65
70
75
80
Height (in.)
Now let's see how to create this scatter plot using Excel.
1. Start by opening the workbook handheight. You should find three variables, sex, height, and
handspan, with 167 values in each column not counting the variable names in the first row.
2. Select all values in the height and handspan columns. That is, cells B1 through C168.
3. Click Insert, click on the Scatter option and select the first option.
19
The points on the scatter plot should be clustered tightly in the upper right corner of the graph. Let's
change that. Notice that the points range from approximately 55 to 80 on the horizontal axis and
approximately 16 to 26 on the vertical axis. We will make use of that information to adjust the scales on
the two axes.
1. Double-click on one of the numbers that label the horizontal axis. This should open the Format axis
window.
2. Click the Axis Options tab if it is not already in front.
3. To the right of the word Minimum, select Fixed and replace the number by highlighting the 0 and
typing 55, to the right of Maximum, select Fixed and replace 100 with 80, and to the right of Major
unit select Fixed and replace 20 with 5. Then click Close.
4. Double-click on one of the numbers that label the vertical axis. Again, click the Axis Options tab if it
is not already in front.
5. Next to Minimum, click Fixed and replace 0 with 14, next to Maximum click Fixed and replace 30
with 26, and next to Major unit click Fixed and replace 5 with 2. Then click Close.
20
Your scatter plot should now look like the one shown above. You will probably note a few differences in
format such as the color of the background and the absence of grid lines on the scatter plot above. I
encourage you to experiment with right-clicking at various places on your graph and exploring the menu
options that appear.
Standard Excel does not offer the option to indicate two groups on a scatter plot using a third variable as in
Figure 3.4. However, there are statistical add-ins for Excel that do offer this option.
Trend Line, Regression, and Residuals
Now follow the same procedure to create a scatter plot of driver age and maximum legibility distance of
highway signs as shown in Figure 3.7 in MOS (refer back to Example 3.2). The data are in the workbook
signdist. As before adjust the limits and major units of the scales so that they are roughly the same as
shown in Figure 3.7. I used 15 to 85, with a major unit of 10, on the horizontal axis and 250 to 600, with a
major unit of 50, on the vertical axis. Now it's time to add a trend line to scatter plot.
1. Move the cursor over one of the points on your scatter plot and right-click.
2. When the menu pops up, select Add Trendline.
3. When the Add Trendline window opens, select the Linear Trend/Regression type.
4. Click the box to the left of Display equation on chart.
5. Click Close.
You should now see a straight line superimposed on your scatter plot similar to the one below. You should
also see the equation y = -3.0068x + 576.68 which is the regression equation corresponding to the trend
line.
21
The regression equation can be used to predict maximum sign legibility distance based on a driver's age.
However, sometimes we want to go beyond the regression equation and obtain additional information about
the relationship between the two variables. For example, as described in Section 3.2, we may want to
analyze the residuals. In order to obtain residuals we use Excel's Regression command, one of the options
in Data Analysis (in the Data menu).
1. Click on the tab for the work sheet containing the age and distance data in signdist.
2. Select Data > Data Analysis > Regression. In the Regression window several boxes must be filled in.
a.
b.
c.
d.
In the box to the right of Input Y Range type the range of cells containing the distance data (for
example, B1:B31).
In the box to the right of Input X Range type the cells containing the age data.
Click the box next to Labels.
In the box to the right of New Worksheet Ply type a name for the new worksheet Excel will create
(I wasn't very creative; I just named it regression.)
22
e.
Click the box next to Residuals and the box next to Residual Plots.
f.
Finally, click OK, and you should get a new worksheet showing regression results.
On the new worksheet, scroll down until you see RESIDUAL OUTPUT. Below that heading you will see
the predicted distance and corresponding residual for each of the 30 points on the scatter plot. A partial
listing is included below. On the upper right part of the worksheet you should see a scatter plot, like the
one shown below, with age on the horizontal axis and residuals on the vertical axis. You can ignore the
rest of the regression output for now.
23
24
Correlation
It is quite simple to determine a correlation coefficient with Excel. Let's do that for the age and distance
data, from Example 3.2, you have just been working with.
1. Click on the tab of the worksheet containing the age and distance data, and click on an empty cell
where you want Excel to put the correlation coefficient. (I clicked cell E3.)
2. Click the Paste Function icon (fx).
3. Under Function category select Statistical, and then, under Function name, select CORREL.
4.
Click OK.
5.
In the Array 1 box type the range of data for either variable, age or distance. Then in the Array 2 box
type the range of data for the other variable. Notice that, for correlation, it does not matter which
variable you specify as array 1.
Click OK.
Correlation
-0.801244651
6.
You should now see the correlation between driver age and maximum legibility distance of highway signs
as shown below. I entered the word correlation in the cell above the correlation coefficient since Excel
does not automatically add a label.
25
Regression Output
Finally, we will look at some of the elements of the regression output generated by Excel. Open the
workbook pennstate1. Letting RtSpan be the explanatory variable and LftSpan be the response variable,
use Excel's regression command (Data > Data Analysis) to perform a regression analysis. For each of the
values below highlighted in yellow, confirm that you got the same result and compare it with the value
shown on page 88 of the text– continued in MOS.
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.950
0.902
0.902
0.639
190
ANOVA
df
Regression
Residual
Total
I ntercept
RtSpan
1
188
189
SS
708.15
76.67
784.82
MS
708.15
0.41
Coefficients Standard Error
1.4635
0.4792
0.9383
0.0225
26
F
Significance F
1736.38
0.000
t Stat
P-value
3.05
0.003
41.67
0.000
Chapter 4: Relationships between Categorical Variables
In chapter 3 we saw how Excel can help us analyze relationships between quantitative variables. In this
chapter we address the same questions: (1) are the variables related and, if so, (2) what is the nature of the
relationship? But this time we will consider categorical variables. We will use the pennstate1.xls data
file. I recommend you also review the section headed Summarizing Categorical Variables with the Pivot
Table in chapter 2 of this manual. Please use chapter 2 of this manual to create a pivot table using Form as
the row variable and SQPick as the column variable. While you are doing that, I'm going to get another
cup of coffee.
In case you are having trouble finding the pivot table, here is what it looks like.
You can easily convert these observed counts into row percents.
1. Move the cursor anywhere over the pivot table and right-click.
2. From the menu that drops down, select Show Values as.
3. Select % of row total.
The new table should look like this:
27
You can obtain a table of column percentages by repeating the procedure above. However, this time, next
to Show values as, select % of column. The new table looks like this.
Finally, to get back to the original table of counts, repeat the above procedure and, next to Show data as,
select No Calculation.
Expected Counts
If you examine Figure 6.4 in MOS, you will notice that it includes expected counts. Recall that these are
the counts we would expect if there were no relationship between the two variables: order of letters on the
questionnaire and letter picked by the student. Let's see how we can use Excel to compute these expected
counts.
1. If your table contains percentages restore them to cell counts. Repeat the three step process above and,
in the final step, under Show data as, select No Calculation.
28
2.
3.
4.
5.
We want to leave the original table where it is, but create a copy and place it to the right of the original.
Starting in the lower right cell of the table, click and drag to the upper left cell. The entire table should
now be highlighted. Use Copy/Paste Special, pasting values only to place a copy of the table with its
upper left cell in cell F2. (If your original table does not have its upper left cell in cell A2, use
Copy/Paste to move it there.)
Note that the following cell references will not work for you unless you have the upper left corners of
your tables in cells A2 and F2, respectively.
In cell G5 type =$D5*B$7/$D$7. Press Enter.
Select cell G5, select Copy, select cells G5 through H6 (the four cells that contain the observed
counts), and select Paste.
Voila! Your copy of the original table should now contain the expected counts. What about that formula
that you entered into cell G5? Where did that come from? Recall that a $ to the left of a row or column
reference makes that reference "fixed" instead of "relative." For example, the reference $D5 has a fixed
column reference but a relative row reference. When the formula is copied into cell G5, the row reference
will change to a 4, but the column reference will remain constant as D. I recommend you study the formula
until you are sure you understand what happens when you copy it to other cells. The ability to combine
fixed and relative references is one of the features that make Excel so versatile. I also labeled the two
tables, in cells A1 and F1, for use in the next section.
The Chi-square Test
We now have the two main ingredients for a chi-square test to determine whether the two variables are
related in the population from which the sample was drawn. Those ingredients are a table of observed
counts and a table of expected counts.
We will use the Excel function Chitest to find the p-value for the test.
1. Click on the empty cell where you want the result of the chi-square test to be placed.
2. Click the Paste Function icon. Under Function category select Statistical, and then under Function
name, select CHITEST.
29
3.
4.
5.
6.
Click OK.
In the Actual range box specify the range of cells that contain the observed counts, not including the
totals. That is, type B5:C6.
In the Expected range box type the range of cells containing the expected counts, G5:H6.
Click OK.
You should now see the p-value:
p-value =
0.004689
Notice that I have added a title so that, if I look at my Excel worksheet in the future, I will remember what
the number represents.
Finding Chi-Square
Excel's Chitest function does not provide the chi-square value, only the p-value. However, if you want to
know the value of chi-square, it is easy to find.
1. Click on the empty cell where you want the chi-square value placed.
2. Click the Paste Function icon. Under Function category select Statistical, then under Function name,
select CHIINV that stands for chi-square inverse. That is, CHIINV takes a p-value and "works back"
to the chi-square value based on the chi-square probability distribution.
30
3.
4.
5.
Click OK.
In the Probability box enter 1- then the reference to the cell that contains the p-value (0.004689).
Alternatively, you can enter the p-value directly instead of the cell reference.
In the Deg freedom box enter the number 1.
6.
Click OK.
You should now see the chi-square value below to which I added a label.
Chi-square =
7.995561
Finding The p-value
Finally, if you already have a chi-square value and want to find the corresponding p-value, here's how to do
it with Excel. Let's use a chi-square value of 7.995.
1. Click on the empty cell where you want the p-value to be placed.
2. Click the Paste Function icon. Under Function category select Statistical, and then under Function
name, select CHIDIST that stands for chi-square distribution. That is, CHIDIST takes a chi-square
value and finds the corresponding p-value based on the chi-square probability distribution.
31
3.
4.
5.
In the X box type either 7.995 or a reference to a cell where you have already entered the chi-square
value.
In the Deg freedom box type the number 1.
Click OK.
You should now see .995311. To find the p-value, select a cell and type =1-.995311 to find the p-value
0.004689.
32
Excel® Technology Manual
to Accompany
Mind on Statistics
© Cengage Learning. All rights reserved. No distribution allowed without express authorization.
FIFTH EDITION
Jessica M. Utts
University of California, Irvine
Irvine, CA
Robert F. Heckard
Pennsylvania State University
State College, PA
Prepared by
Melissa M. Sovak
California University of Pennsylvania, California, PA
Australia • Brazil • Mexico • Singapore • United Kingdom • United States
ISBN-13: 978-1-285-83862-5
ISBN-10: 1-285-83862-9
© 2015 Cengage Learning
ALL RIGHTS RESERVED. No part of this work covered by the
copyright herein may be reproduced, transmitted, stored, or
used in any form or by any means graphic, electronic, or
mechanical, including but not limited to photocopying,
recording, scanning, digitizing, taping, Web distribution,
information networks, or information storage and retrieval
systems, except as permitted under Section 107 or 108 of the
1976 United States Copyright Act, without the prior written
permission of the publisher except as may be permitted by the
license terms below.
Cengage Learning
200 First Stamford Place, 4th Floor
Stamford, CT 06902
USA
Cengage Learning is a leading provider of customized
learning solutions with office locations around the globe,
including Singapore, the United Kingdom, Australia,
Mexico, Brazil, and Japan. Locate your local office at:
www.cengage.com/global.
Cengage Learning products are represented in
Canada by Nelson Education, Ltd.
For product information and technology assistance, contact us at
Cengage Learning Customer & Sales Support,
1-800-354-9706.
To learn more about Cengage Learning Solutions,
visit www.cengage.com.
For permission to use material from this text or product, submit
all requests online at www.cengage.com/permissions
Further permissions questions can be emailed to
permissionrequest@cengage.com.
Purchase any of our products at your local college
store or at our preferred online store
www.cengagebrain.com.
NOTE: UNDER NO CIRCUMSTANCES MAY THIS MATERIAL OR ANY PORTION THEREOF BE SOLD, LICENSED, AUCTIONED,
OR OTHERWISE REDISTRIBUTED EXCEPT AS MAY BE PERMITTED BY THE LICENSE TERMS HEREIN.
READ IMPORTANT LICENSE INFORMATION
Dear Professor or Other Supplement Recipient:
Cengage Learning has provided you with this product (the
“Supplement”) for your review and, to the extent that you adopt
the associated textbook for use in connection with your course
(the “Course”), you and your students who purchase the
textbook may use the Supplement as described below.
Cengage Learning has established these use limitations in
response to concerns raised by authors, professors, and other
users regarding the pedagogical problems stemming from
unlimited distribution of Supplements.
Cengage Learning hereby grants you a nontransferable license
to use the Supplement in connection with the Course, subject to
the following conditions. The Supplement is for your personal,
noncommercial use only and may not be reproduced, or
distributed, except that portions of the Supplement may be
provided to your students in connection with your instruction of
the Course, so long as such students are advised that they may
not copy or distribute any portion of the Supplement to any third
party. Test banks, and other testing materials may be made
available in the classroom and collected at the end of each class
session, or posted electronically as described herein. Any
material posted electronically must be through a passwordprotected site, with all copy and download functionality disabled,
and accessible solely by your students who have purchased the
associated textbook for the Course. You may not sell, license,
auction, or otherwise redistribute the Supplement in any form. We
ask that you take reasonable steps to protect the Supplement from
unauthorized use, reproduction, or distribution. Your use of the
Supplement indicates your acceptance of the conditions set forth in
this Agreement. If you do not accept these conditions, you must
return the Supplement unused within 30 days of receipt.
All rights (including without limitation, copyrights, patents, and trade
secrets) in the Supplement are and will remain the sole and
exclusive property of Cengage Learning and/or its licensors. The
Supplement is furnished by Cengage Learning on an “as is” basis
without any warranties, express or implied. This Agreement will be
governed by and construed pursuant to the laws of the State of
New York, without regard to such State’s conflict of law rules.
Thank you for your assistance in helping to safeguard the integrity
of the content contained in this Supplement. We trust you find the
Supplement a useful teaching tool.
Excel® is a trademark of the Microsoft group of companies.
Excel Technology Manual for Mind on Statistics 5e is an independent publication and is not affiliated with, nor has it
been authorized, sponsored, or otherwise approved by Microsoft Corporation.
Printed in the United States of America
1 2 3 4 5 6 7 17 16 15 14 13
Contents
Chapter 1: Introduction .................................................................................................................. 1
Chapter 2: Turning Data into Information ..................................................................................... 5
Chapter 3: Relationships between Quantitative Variables ........................................................... 19
Chapter 4: Relationships between Categorical Variables ............................................................ 27
Chapter 5: Sampling: Surveys and How to Ask Questions.......................................................... 33
Chapter 6: Gathering Useful Data for Examining Relationships ................................................. 38
Chapter 7: Probability .................................................................................................................. 39
Chapter 8: Random Variables ...................................................................................................... 42
Chapter 9: Understanding Sampling Distributions: Statistics as Random Variables................... 53
Chapter 10: Estimating Proportions with Confidence ................................................................... 59
Chapter 11: Estimating Means with Confidence ........................................................................... 64
Chapter 12: Testing Hypotheses about Proportions ....................................................................... 73
Chapter 13: Testing Hypotheses about Means............................................................................... 81
Chapter 14: More about Regression............................................................................................... 88
Chapter 15: More about Categorical Variables.............................................................................. 91
Chapter 16: Analysis of Variance .................................................................................................. 94
iii
Chapter 1: Introduction
Organization of this manual
This manual's goal is to help you learn to perform the computational parts of statistical analysis using
Microsoft Excel. Each chapter is a companion to the corresponding chapter in your Mind On Statistics
textbook. I have used the same chapter titles to avoid confusion. The manual uses examples from the text
so that, after you have analyzed the data using Excel, you can always check you results with those in the
text. When a method discussed in the text is not included in this manual that means that method is not a
feature of standard Excel nor can the method be implemented using formulas presented in the text.
Excel
This manual is not a comprehensive guide to Excel. It focuses specifically on statistical analysis.
Furthermore, it does not explain how to use a personal computer or how to work with the Microsoft
Windows operating system, as this manual was written assuming that the user has had experience with this
operating system.
The first step, of course, is to make sure that the computer you plan to use has Microsoft Excel installed.
Excel is part of the Microsoft Office suite of programs. If you are not sure whether Excel is on your
computer, the fastest way to find out is to click the Start button, scroll up to Programs, and look for
Microsoft Excel in the list of programs that appear on the screen. If you find it, click on the title, and Excel
will open. As it does, you will see a small window appear on the screen temporarily that indicates what
version of Excel is on the computer. In this manual I have written all instructions based on Excel 2003.
However, if you have an earlier version of Excel, you will find that most of the instructions I give will
work for you as they are presented here. For Excel 2010 the first screen should look like this
Before we explore Excel, I want to call your attention to a convention I just used because I will use it
throughout the manual.
1.
2.
I will use red type for references to the textbook such as case study 1.1.
I will use green type to refer to variables such as HrsSleep and data files such as pennstate1.
1
3.
4.
I will use blue type when an action is called for, such as click or scroll or when I am referring to
an Excel menu item such as File or an Excel element such as the function Average.
I will use bold type for a reference to a cell such as A3, text you are to type in a cell such as
=4*A3, and the contents of a cell after you have carried out an instruction.
If you are not familiar with all of the terms in these four statements, don't worry; we'll take care of that right
now.
You should now have Excel open on your computer screen. Across the top of the screen you will see a list
of menu names, File, Home, Insert, Page Layout, etc. Most of these you are already used to seeing if you
use Microsoft Word. Under these you should see several options associated with the Home menu such as
text options, alignment options, copy and paste options and more.
The rest of the screen is either dark gray or is white with a grid of vertical and horizontal lines as shown
above. If the screen is dark gray, use the mouse to move the cursor to the File menu and click on New and
a new workbook opens. Near the bottom of the screen you should see tabs labeled Sheet1, Sheet2, and
Sheet3. When you open a new workbook, it contains three worksheets. You can add or delete worksheets
as needed, but more about that later. You should also see that grid of lines I mentioned before. There
should be headings across the top of the grid: A, B, C, etc. These are the column headings. Down the left
side of the screen there should be numbers: 1, 2, 3, etc. These are the row headings. Where a column and a
row intersect is called a cell, and its row and column designators refer to the cell. Thus D3 is the cell
reference for the cell in the fourth column and the third row. Notice that cell D3 is outlined with a dark line
in the figure below.
The cell is the basic work unit within an Excel worksheet. Let’s have our first look at what you can do with
a few cells in a worksheet.
1.
2.
3.
In cell A1 type Temperature Conversion. Notice that not all of the text fit in cell A1, but Excel
allowed it to spill over into cells B1 and C1.
In cells A3 and B3 type Temp F and Temp C, respectively.
In cell A4 type 68, and in cell B4 type =5/9*(A4-32). After you press the Enter key you should see the
number 20 in cell B4. You have just converted a temperature in Fahrenheit, 68 degrees, into a
temperature in Celsius, 20 degrees.
2
4.
5.
6.
7.
Now click on cell B4. Notice the dark box around the cell. This tells you that cell is currently active.
Just below the formatting toolbar, you should see the Formula Bar containing what you typed in cell
B4. If you needed to edit what you typed in the cell, you would edit the contents of the Formula Bar.
With cell B4 active, move the cursor to the home toolbar and click on the icon B. The 20 in cell B4
should now be in boldface type. Click on the B again, and the boldface goes back to regular type.
To the right of the icons B, I, and U are four icons for aligning text. Click the icon to Center the
contents of B4. Also click to Center the contents of cell A4.
Finally, change the title in cell A1 to boldface type.
This is what your Excel worksheet should now look like.
Let’s review what you just did. You typed three kinds of contents into worksheet cells: text in cells A1,
A3, and B3, a numerical value in cell A4, and a formula in cell B4.
I want to show you one more feature of Excel before we leave our temperature example.
8.
Type the number 70 in cell A5. Center the number in this cell.
icon to copy the formula.
9. Click on cell B4. On the home menu, click the
10. Click on cell B5 to make it active, and then click on Paste. What do you see in cell B5? It should be
the number 21.11111.
11. Let’s round this off to 21.1. Click on cell B5 to make it active. Move the cursor to the Decrease
Decimal icon on the formatting menu. The icon looks like this:
.00
>.0
12. Click on this icon. The number in cell B5 should now be 21.1111. Click the Decrease Decimal icon
three more times, and cell B5 should now contain 21.1. However, it is important to realize that only
the display of the number in the cell has been rounded. The number stored in the memory of the
computer has not been changed. Your worksheet should now look like this.
3
Let’s review. You now know how to:
1. Open a new workbook
2. Make a worksheet cell active
3. Enter content into a cell
4. Change the format of the content, and
5. Copy the content of a cell and paste it into another cell.
In later chapters you will discover how much time that copy/paste feature can save you. These are the
basics. In the chapters that follow we will explore many additional capabilities of Excel, but you will use
these basic features every time you work with Excel. Now let’s proceed to chapter 2 and do some statistics.
A Note about Add-Ins
The capabilities of basic Excel for statistical analysis are somewhat limited. If you are using Excel for this
purpose, you should be aware that there are several statistics add-ins available that extend Excel’s statistical
capabilities significantly. Some are sold commercially, and some are bundled with statistics textbooks.
4
Chapter 2: Turning Data into Information
In chapter 2 of the textbook, you learned a variety of ways, both graphical and numerical, to summarize a
set of numbers. In this chapter of the guide we will see how you can use Excel to help you with these tasks.
Types of Data
The first thing we need to discuss is types of data. This is very important since not all tools for
summarizing data are appropriate for all types of data. Unfortunately, the computer cannot, in most
situations, tell you which tools to use and which not to use. It will obediently use the wrong tool and give
you results that are nonsense. So, you need to first identify what kind of data you have. Here is a rule of
thumb that can help. If the data are represented by text, such as "left," "male," "yes," or "strongly agree,"
then it is almost certainly categorical data. If numbers, such as 3 or 6.514 represent the data, then it is
likely to be measurement data. Try this rule on the list of eight questions and the resulting data in Section
2.1 of the textbook. Don't be misled by the text that gives units of measure such as hours, inches, and mph.
Those terms are not part of the data. How many of the questions did you find that result in categorical
data? If you said questions 1 (sex: m or f) and 3 (letter: S or Q), you are correct. The other six all result in
measurement data.
You do need to watch out for categorical data that have been coded numerically. For example, if you have
data representing responses to question 1, the sex of the respondent, in Section 2.1 of the textbook, but the
data are all 0 and 1, don't be misled. Instead of "m" and "f" the data could also be "coded" using 0 for male
and 1 for female. Sex is still a categorical variable because the responses all fit in one of two categories
and do not represent quantitative information. That is, we would not count or measure to determine which
category a person belongs to. With this rule of thumb and caution in mind, let's get busy summarizing. I
will point out, as we go along, a few ways that Excel tries to help you match the tool to the data.
Summarizing Categorical Variables with the Pivot Table
The first Excel tool that we will meet for summarizing data is also one of the most useful. It is called the
pivot table and is found in Excel's Data menu. I will introduce you to the pivot table by showing you how
to use it to create a summary of a categorical variable similar to Tables 2.1 and 2.2 in the text. First, you
need to open Excel and then open the Excel data file YouthRisk03 (from the companion website) using the
procedures described in chapter 1. While you do that, I'm going to get a cup of coffee.
You should now see an Excel worksheet with data. You should see names in the top row: Sex, Grades, etc.
There are five columns and 3042 rows of data (not counting the row with the variable names). Your screen
should look like the picture below. Note that only the row of variable names and the first 16 rows of data
are shown. You need to scroll down to see the rest of the data.
Using the terms from Section 2.1 of the textbook, observe that there are five variables, one in each of the
first five columns of the worksheet, 3042 observational units, in rows 2 through 3043, and that one
observation, seatbelt use for observational unit 11 is the active cell, E12. The complete data set resides in
the block of cells A2 to E3043. It is important to remember that, since we always use row 1 for the
variable names, the row number of the last row of data will be the number of observational units plus one.
For this data set that is 3042 + 1 = 3043.
5
Here are the steps of the procedure for creating a pivot table:
1. Click on a cell anywhere in the data set.
2. Now move the cursor to the top of the screen and click on the word Insert. That will cause a new menu
ribbon to appear.
3. Click on Pivot Table. This opens the Create Pivot Table window.
4. Select the entire data set, including the labels in the first row are in the selected range of cells. If it is
not, you can change row and column references in the Table/Range box to include the entire data set.
5. Under Choose where you want the Pivot Table report to be placed, select New worksheet. Click OK.
You should now see the Pivot Table and Pivot Chart Wizard – Layout window shown below.
6.
Find the Seatbelt button on the right and drag it to the Row section of the Layout template. Also drag
the same Seatbelt button to the Data section of the Layout template.
6
You should now have a pivot table that looks like this.
Notice that this table does not match the one in Section 2.3 of the textbook. The category names are in a
different order and they have numbers in front of them. However, we can edit the pivot table to look like
the one in the textbook. Move the cursor over cell A5 and click to make that cell active as shown above.
The contents of cell A5 should now be visible in the Formula Bar. Click on the Formula Bar and position
the cursor to the left of the word “Never.” Press the backspace key twice (to delete “1_”). Repeat this
process for cells A5 through A8.
Now the category names should be the same as those in the textbook table, but they are still in a different
order. Right-click, that is, click the right button on your mouse, on cell A9 and, as that cell becomes active,
a menu should appear. Move the cursor over the word Order and a sub-menu of options should appear.
Click on Move to Beginning. Repeat the use of the Order options until your pivot table looks like this:
If you want to convert the counts to percentages:
7
1.
2.
3.
4.
Move the cursor anywhere over the pivot table and right-click.
From the menu that drops down, click Value Field Settings.
Select the Show Values As tab.
Select % of column total from the drop down menu.
5.
Click OK.
Your table should now look like this:
This table summarizes one variable, the frequency of use of seatbelts. So it is a "one-way" pivot table.
Now let's create a "two-way" pivot table to summarize the frequency of seatbelt use for men and women
separately.
1. Again, click any cell in the data and then click on Insert. As before, click Pivot Table.
2. Check to see that all of the data cells, including variable names, are selected, and click OK.
8
3.
This time you should drag the Sex button into the Row section of the template, the Seatbelt button into
the Column section, and the Seatbelt button into the Data section. You should now see a button in the
Data section that says Count of Seatbelt. (Actually, you could drag either the Seatbelt button or the
Sex button into the Data section and get the same result.)
Does your table look like this?
This table contains the same information as is shown in Table 2.2 in the text. Based on the numbers in the
table, would you say that the order in which the letters are presented influenced the choices made by the
students who answered? It might be easier to answer the question after looking at percents instead of
counts. Repeat the procedure used above to change Field Settings. This time under Show values as, select
% of row. Does your table look like this?
9
Visual Summaries for Categorical Variables: Pie Charts and Bar Graphs
Tables of numbers like the tables you just created often come in handy in interpreting data. However, a
well-chosen graph can not only be easier to understand but also be visually more powerful. Let's look at a
couple of graphs that Excel can create. Look at Figure 2.2 in MOS, your textbook. We are going to create
a pie chart very similar to that, but in order to do so we must first create another pivot table. Go back to the
data in pennstate1 and use the procedure described above for one-way pivot tables to create a pivot table
for the column headed RandNumb. When you drag the RandNumb button into the Data section of the
template you will probably see Sum of RandNumb. To change this from sum to count, double click on the
Sum of RandNumb button. The Pivot Table Field window should open. Under Summarize by, click
Count, click OK, and then proceed as before. Your table should look like the one below.
Next you will instruct Excel to use this table to create a pie chart.
1. Move the cursor over the table and click on any cell.
2. Now move the cursor to the top of the screen and click Insert.
3. Click Pie and select the first option.
Check you pie chart against the one shown below. Note that it is not exactly like Figure 2.2 in MOS, but
the difference is not significant. Repeat steps 2 and 3 above, but, this time, select Column under Chart type
instead of Pie and select the first option. Note that it is essentially the same as the bar chart in Figure 2.2 in
MOS.
10
Does your bar chart look like the one below?
Before we finish our exploration of how to summarize categorical data, we will create one more graph a bar
chart for two categorical variables. You might want to reread Example 2.2 in MOS and look at Figure 2.4.
1. First, in a blank Excel worksheet, create a table like Table 2.3. Since you do not have the original data
from the survey of 479 children, you should type in category names and percentages to create a table
like the one in the text.
2. Do not include the percent sign (%) or counts shown in Table 2.3. Instead enter each percent as a
decimal fraction. For example, the cell in the row labeled Darkness and the column labeled No
Myopia contains 90%. Type this into your table as 0.9.
3. Now press Enter and then click on the cell into which you just entered the 0.9.
4. In the tool bars at the top of the screen, find the % icon. Hold the cursor over it for a few seconds
without clicking. The phrase Percent Styles should appear. Click on the icon. The contents of the cell
should have changed from 0.9 to 90%.
11
5.
6.
7.
Type the rest of the percents into the table as decimal values (0.09, 0.01, 0.66, etc.), highlight all of the
numbers, by clicking and dragging over the numbers, and again click on the % icon. You do not need
to include the Total column or row, as you will not include them in your graph.
Now, highlight all of the percentages in your table along with the column labels (No Myopia, etc.), but
not the row labels. Then click the Insert.
Select Column and select the first option under 3-D Column.
Your bar chart should look like the one below. Compare it with Figure 2.3 in MOS. Note that the colors in
the chart below are the same as the ones in the text. This is not important except to demonstrate that you
can change the formatting of a chart created by Excel.
Finding Information in Quantitative Data: The Five-Number Summary
We now move on to summarizing measurement or quantitative data. We will begin by creating the fivenumber summaries for Example 2.5 in MOS. To create these summaries you will use several of Excel's
built-in statistics functions. But first a little data sorting is needed. It is sometimes the case that the way
the data are arranged is not the way you need to have them arranged for the analysis you want to perform.
So knowing how to rearrange the data can come in handy.
In the pennstate1 workbook, the right hand span data are in what is called a "stacked" format. That is
because right hand spans of women and those of men are stacked up in the same column. The only way we
can tell which is which is to have another column, in this case the variable called Sex, to tell us which
numbers are hand spans of women and which are those of men. What you need to do now is “unstack” the
RtSpan column so that you have hand spans for men and hand spans for women in two different columns.
1.
2.
3.
Find a blank worksheet in the Excel workbook containing the hand span data. If there isn’t one, move
click on the tab without a name to create a new worksheet.
Next click on the tab of the worksheet containing the data and click the letter at the top of the column
containing the variable Sex. This is probably column A. The entire column should now be
highlighted.
Click the Home menu and select Copy.
12
4.
5.
6.
7.
8.
9.
Now click the tab of the blank worksheet, click on cell A1, click on the Home menu, and select Paste.
You should now have the Sex variable in the leftmost column (column A) of the new worksheet.
Go back to the data worksheet, select the RtSpan column, and then copy/paste it into column B of the
new worksheet. I’ll wait.
Now we’re ready to sort data. In the new worksheet, click on any cell in the data, then click on the
Data menu and select Sort. The Sort window should now be open.
If the variable Sex is not selected in the Sort by box, click on the down arrow and select it.
Click Add Level.
In the Then by box, select RtSpan. Your Sort window should look like this.
10. Click OK.
The data are now sorted. Next you will rearrange the hand span data into two columns.
1.
2.
3.
4.
5.
Scroll down to the last row containing the word Female in column A. This should be row 104.
Click on the cell containing the first male right hand span. Is this cell B105? It is in my worksheet.
Now scroll to the last row that contains data, row 191, hold down the shift key, and click on cell B191.
Go to the Home menu and click Cut.
Finally, scroll back to the top of the worksheet, click cell C2, and click Home and then Paste.
Whew! You’ve done it. The data are unstacked, and you are ready to compute five-number summaries.
This may seem like a lot of work just to prepare the data for analysis, but once you have done it a few
times, you will do it very quickly. There is one more small detail to see to. You need to define column
headings (variable names) that reflect the new arrangement of the data. For example, I typed F Span in cell
B1 and M Span in cell C1.
1.
2.
3.
4.
Click on a blank cell to the right of the data. I chose cell E7.
Type the word Median.
Now move two cells to the right and type =MEDIAN(. Click on the first male hand span (cell C2),
then hold down the shift key and click (that's called a shift-click) on the last male hand span (cell C88).
Type a right parenthesis, ), and press the enter key.
You should now see 22.5, the median of the male hand spans, in cell G7. You have just used one of
Excel’s many built-in functions. Go back and highlight the cell so that you can review what you typed.
Start with an equal sign, then type the name of the function, in this case median, and then parentheses
containing any “arguments” required by the function. For the median function, the arguments required are
the first and last cells that contain the data. By the way, I always type the function name in all capital
letters to remind me that I am using an Excel function, but that is not required.
There is an alternative way of entering a function into a cell.
13
1.
2.
3.
4.
5.
6.
7.
8.
9.
Click the cell immediately under the cell in which you typed Median. Type the word Quartiles.
Now click the cell two cells to the right, G8.
Go to the top tool bar and click the Paste Function icon. It looks like this: fx.
The Paste Function window should now be open on your screen.
Scroll down until you find Quartile, click on it, and then click OK.
In the Quartile window, click on the small red arrow on the right side of the Array box. The Quartile
window collapses into a single box.
Now click the top male hand span, scroll down, and then shift/click the last male hand span. All of the
male hand spans should now be selected. In the box still on the screen you should see C2:C88.
Click the small red arrow on the right end of the box, and the Quartile window should re-appear.
In the box labeled Quart type 1 to indicate that you want the first quartile, and then click OK. The cell
you first selected should now contain 21.75, which is the first quartile of the male hand spans.
I’ll bet you are ready to compute the rest of the numbers for the male hand span five-number summary and
then compute the summary for the female hand spans. Here are a few hints. To find the third quartile, enter
3 into the Quart box; to find the largest number in a data set, use the MAX function; and to find the
smallest number, use the MIN function. When you have finished here is what you should have.
Histograms, Stem-and-leaf plots, and Dotplots
Unfortunately, Excel’s histogram is not one its best features. The procedure for creating a histogram in
Excel is cumbersome, and the result does not look quite the way a histogram is supposed to look.
Nevertheless, let’s create the histogram first; we can identify its strengths and weaknesses later. I want you
to use Excel to create a histogram, like the one in Figure 2.7 in MOS, of women’s right hand spans. The
process is made easier by the fact that you have already computed the five-number summary for this data.
For a histogram we need to determine a set of categories into which the data will be grouped. We need to
tell Excel what the boundaries of those categories are, what Excel refers to as the “bins.” We know –
because we have the five-number summary – that these hand spans range from a low of 12.5 inches to a
high of 23.25 inches. There are many ways we could define categories to cover this range, but let’s copy
the categories used in Figure 2.5. The categories are 10 to 11, 11 to 12, 12 to 13, and so on up to 23 to 24.
The bins, corresponding to these categories, for Excel are 11, 12, 13 up to 24.
14
1.
2.
3.
4.
5.
6.
7.
8.
In the worksheet where you sorted the hand span data, select a column to the right of the data, type the
word bins in the first cell and the numbers 11 to 24 in the cells under that, one number per cell.
Next go to the Data menu and select Data Analysis.
If Data Analysis is in the list of tools, skip to the step 4. If it is not in the list, you will need to add it.
a. To do this, go to the File menu and select Options then click Add-ins.
b. In the Add-ins window that appears you should see Analysis ToolPak. Click the box to the left of
that, and then click OK. It may take Excel a few seconds to load the tool pack. Now go back to
the Data menu, and you should find Data Analysis listed.
When the Data Analysis window opens, scroll down to Histogram and then click OK.
In the Input Range box, click the red arrow and select the range of female hand span data, including
the variable name in the first row.
In the Bin Range box repeat the procedure to select the list of bins, again including the name bins in
the first row.
Now click the Labels box, then click Chart Output, and finally click OK.
The histogram may be quite squashed down. If so, click on it anywhere and, then move the cursor
over the small black square in the center of the bottom of the histogram box. The cursor should change
to a double arrow. Click and hold the mouse button down while you drag down until the histogram is
large enough to be easily to understand.
It should look like the one below.
15
Notice that the shape formed by the vertical bars in the histogram is very similar to that in Figure 2.7 in the
text. However, there is a difference between the two histograms. The main feature that distinguishes a
histogram from a bar chart is that, in a histogram, there are no gaps between the vertical bars. Note that
this is the case with Figure 2.7. It is not true; however, of the histogram you have just created using Excel.
The gaps between bars in a bar chart are there to emphasize that the bars represent distinct categories. For
the same reason, the histogram should not have gaps because the bars represent categories that make up one
continuous and uninterrupted range of numbers. It is a flaw in Excel's histogram that it is depicted as a bar
chart. However, here is a histogram of the same data created with an Excel add-in.
Histogram of women's right handspans
30
25
Frequency
20
15
10
5
0
<=11
11- 12
12- 13
13- 14
14- 15
15- 16
16- 17
17- 18
18- 19
19- 20
20- 21
21- 22
22- 23
>23
Right handspan (cm)
Standard Excel does not include stem-and-leaf plots, dot-plots or box-plots among its data analysis tools.
Thus we cannot use Excel to create plots similar to Figures 2.8 and 2.9 in MOS. As mentioned in chapter 1
of this manual, there are add-ins that extend the statistical capabilities of Excel. Several of these add-ins
include menu options for creating one or more of these plots.
Summary Measures
Section 2.5 in MOS presents several summary measures for quantitative variables. Let's see how to use
Excel to compute those measures. Specifically, we will use Excel to compute the mean, range, and
interquartile range, in addition to the five-number summary encountered earlier. We will use theSongs on
Student iPods data shown in Example 2.11 in MOS. Before you can compute the summaries, you will need
to enter the data into Excel. Go ahead; I'll wait.
1.
2.
3.
In cells E5 to E12 type the following eight labels in a column: mean, minimum, 1st quartile, median,
3rd quartile, maximum, range, and interquartile range.
To compute the mean, use Excel's Average function. In the cell to the right of the word mean, type
=AVERAGE(a2:a25). You already know how to find each of the numbers in the five-number
summary. So go ahead and do that.
The range is simply the largest speed minus the smallest speed. Thus in the cell to the right of the
word range, you can type =F10-F6 since F10 should be the cell where you determined the maximum
speed (using the MAX function) and F6 should be the address of the cell where you determined the
minimum speed (using the MIN function). Finally, the interquartile range is the difference between
the third quartile and the first quartiles. So, you can compute this using the quartiles you have already
determined in the same way you just computed the range.
When you are finished, your results should look like this.
16
If you want to check your formulas in cells F5 through F12, here is what they should be.
Finally, you will compute a variance and a standard deviation. First, let's use the Songs data set listed in
Example 2.11 of MOS.
1. Type these numbers in cells A2 through A25 in an empty Excel worksheet. Remember that, if you
don't have any empty worksheets, you now know how to add one (reminder: look in the Insert menu).
2. Type a label in cell A1 (I used Songs) just to get in the habit of always using a label for a column of
data.
3. Next in a blank cell type the word Variance. In the cell to the right of this type =VAR(A2:A25).
4. To obtain the standard deviation, use the Excel function STDEV in the same way you just used VAR.
That is, type =STDEV(A2:A25). The results are:
17
Let's review what we have accomplished in this chapter. You can create one-way and two-way frequency
tables (using Pivot Table and Pivot Chart Report from the Data menu).You now know how to use Excel to
create a pie chart or a bar chart (using the Chart Wizard) and a histogram (using the Histogram command
under Data Analysis in the Tools menu). You also know how to use Excel to compute a five-number
summary as well as a mean, range, interquartile range, variance, and standard deviation (all using Excel’s
built in functions). Along the way you learned how to unstack and sort data (using the Sort command
under Data). When you have a set of data to analyze, it is a good idea to start by creating graphs and
computing summary measures to “get a feel for” the data. It may not seem like we have covered a lot when
it’s listed in one short paragraph, but you now have a useful array of tools for summarizing a set of data.
18
Download