Using Excel for Chapters 3 and 4

advertisement
Microsoft Excel Users Guide
to accompany
Statistics: Unlocking the Power of Data
by Lock, Lock, Lock, Lock, and Lock
Excel Users Guide- 1
Statistics: Unlocking the Power of Data
0.) Getting Started
Microsoft Excel is a simple and very widely used spreadsheet application. It is especially helpful for storing
data, visualizing data, and doing simply algebraic manipulations like addition and multiplication. Excel is not
really designed to do complex statistical analysis, and therefore some of the methods described in the later
chapters of the textbook are not addressed in this guide. The Analysis Toolpak add-in does allow for basic
statistics in Excel, however. For instructions on how to install and load the Toolpak, see Section 0.3 (but this is
not needed for the material in Chapters 1 & 2 of the textbook).
This guide applies to Microsoft Excel 2010 for Windows. Other versions of Excel for Windows and Mac OS
are similar, but be aware that the specific steps described in this guide may not be the same for alternative
versions.
0.1: Entering Data
When you first open Excel, you will see a big empty spreadsheet. We
would like to fill these empty cells with data!
Typically we organize data in Excel just as we structure a data table;
each column represents a variable and each row represents a case or
unit. The variable labels are given in the first row of cells, and the
values for each case are given in the rows below. To enter information,
just click on a cell and start typing. The arrow keys are also useful for
navigating between the cells. You can enter anything you want in
each cell - text for categorical variables, and numbers for quantitative variables.
As an example, let’s enter some of the student survey data from Table
1.1 of the textbook. The first row labels each variable, so cell A1
should be ID, B1 should be Gender, and so on. It helps to make these
bold (by highlighting them and pressing CTRL+b) so that they stand
out. The values for each student are then entered in the cells below.
You can open another empty spreadsheet by clicking on the tabs at the
bottom of the screen (by default these are labeled ‘Sheet1’, ‘Sheet2’,
etc.).
When you save your Excel Workbook as an ‘.xlsx’ file, all of your sheets are saved to a single file.
0.2: Manipulating Data and Basic Operations
In Excel it is very easy to manipulate data; all you need to do is click on a cell and change its value. But be
careful! When entering or manipulating data yourself it is easy to make a mistake, and it hard to identify
mistakes and fix them later.
Excel Users Guide- 2
Statistics: Unlocking the Power of Data
Most basic operations for quantitative data are done using
the formula bar fx above the spreadsheet. To specify how to
compute the value of a cell, click on a cell and enter a
formula in the bar above. All formulas should be preceded
by ‘=’. The basic operations are ‘+’ for addition, ‘-‘for
subtraction, ‘*’ for multiplication, ‘/’ for division, and ‘^’ for
an exponent. If we enter ‘= (2*3)^2+4’ and press ↵, the
value of the current cell will be 62+4 , or 40. We can also
perform operations using other cells. So if we click on A3 and enter ‘=A1/A2’then press ↵, the value shown in
A3 will be whatever is in A1 divided by whatever is in A2. Then, if we change the value in either A1 or A2, this
changes the value in A3. Of course, if either A1 or A2 does not have a numeric value, this will result in an
error.
It is easy to use a formula to compute a group of cells, like an entire row or column. One way to do this is the
click and drag feature. Let’s return to the student survey example from Section 0.1; we would like define a new
variable that is the number of hours spent exercising minus
the number of hours spent watching television in a week:
Exercise – TV. Exercise is given in column E and TV is
given in column F, so we’ll let column G be the difference.
After labeling cell G1, we use the formula bar to give the
difference for the first student: ‘=E2-F2’, in cell G2. Then,
click on cell G2 and move your cursor to the bottom left
corner of the cell. Click and hold here, then drag down to
cover cells G3 to G11 and release. This should extend the
formula so that G3=E3-F3, G4=E4-F4, and so on. This click
and drag feature can be much more quick than separately
entering the formula in each cell.
0.3: Loading the Analysis Toolpak
The Analysis Toolpak for Excel is necessary for some statistical methods. All versions of Excel 2010 should
have this toolpak as an option, but you probably need to load it first. To load it, click File in the top left corner
and then select Options. This will open a new window. Select Add-Ins, then click the Go button near the
bottom of the window. This will open another window. Check the Analysis Toolpak box and click OK. When
you select the Data tab you should now see a Data Analysis icon at the top right of the screen.
Excel Users Guide- 3
Statistics: Unlocking the Power of Data
1.) Using Excel in Chapter 1
1.1 Random Assignment for Experiments
Excel can be used to randomly assign units to experimental groups. A
simple way to do this is to use the RAND function. Entering ‘=RAND ()’ in
the formula bar will generate a random number between 0 and 1 in the
current cell. We can generate a random number for each experimental unit,
and then assign groups based on these random numbers.
For example, perhaps a college professor wants to randomize her class of 16 students into two equally sized
groups. One group will receive Exam A, the other group will receive Exam B (this is similar to Example 1.27
in the textbook). She starts by listing the names of her students in the first column. The names are listed in
alphabetical order. We can randomize the students to the two exam groups in three steps:
1.) Add a second column of random numbers, titled ‘Random’. To do this enter ‘=RAND()’ in the first
row, then click and drag to cover the remaining 15 cells.
2.) Sort the names based on the random numbers. To do this highlight the 16 student names and then
selecting Sort under the Data tab above. Select Expand the Selection and click Sort. This allows you to
specify which variable to sort by. Select Random next to Sort by and click OK1. The 16 names now
appear in a completely random order.
3.) We will assign the first 8 students to Exam A and the next 8 to Exam B. Enter a third column titled
Exam and enter A in the first 8 rows and B in the next 8 rows. This gives the random exam assignment
for each student.
1
This will generate a new set of random numbers, and new numbers will generate any time you change a cell. This is fine, but if you
find it annoying you can change the settings so that the random numbers are generated once and then stay fixed. To do this select
File->Options->Formulas, then choose Manual under Workbook Calculation.
Excel Users Guide- 4
Statistics: Unlocking the Power of Data
2.) Using Excel in Chapter 2
2.1 Counts and Proportions for Categorical Variables
The COUNTIF() function can be used to count the number of
times something occurs in a range of cells. This is useful for
calculating counts and proportions for categorical variables. For
example, say we have a list of 16 students, and the award that
each student would prefer to win (Olympic medal, Academy
award, or Nobel prize) is given in cells B2 to B17. If we click
on an empty cell and enter ‘=COUNTIF(B2:B17,”Olympic”)’ in
the formula bar, the output will be the number of times
‘Olympic’ occurs in cells B2 to B17. Note that the colon is used
to indicate a range of cells – we can either type this into the
formula bar or use a curser to highlight the desired range. To
get a proportion, divide the total count by the number of units,
for example: ‘=COUNTIF(B2:B17,”Olympic”)/16’.
2.2 Bar Charts and Pie Graphs
Simple graphs can be created by highlighting the data, clicking the Insert tab above and selecting the desired
chart. Bar charts and pie charts require a name and count for each category.
Consider the award preference data in Section 2.1. We have already found the count for each category using
the COUNTIF() function. Now, we simply highlight the three category names with their counts, and select
Insert->Pie Chart -> 2D Pie Chart
above. This will automatically
place a pie chart of the selected data
over the spreadsheet, and this chart
will also be saved when we save as
A
a ‘.xlsx’ file. A bar chart can be
created similarly by highlighting the
same data and selecting Bar Chart
above. We can modify the look,
labels and features of the graph by
using the Chart Tool toolbar above.
Excel Users Guide- 5
Statistics: Unlocking the Power of Data
2.3 Histograms
The standard version of Excel does not have the capability to automatically make a histogram from quantitative
data. However, we can make one using the Bar Chart option with just a little bit of work. We need to first
define and label our bins, and then use the COUNTIF() function to count the number of units that fall within
each bin.
As an example, let’s add a third column to the student
survey from Section 2.1 and 2.2, Exercise (the
number of hours spent exercising per week). Then,
let’s label our bins (of width 5) somewhere else on the
spreadsheet: 0 to 5, 5-10, 10-15, and 15-20. In this
simple example it’s possible to manually count the
number of units that fall into each bin, but let’s do it
automatically. For example,
= COUNTIF(C2:C17,”<=10”) − COUNTIF(C2:C17,”<=5”)
counts the number of Exercise values between 5 and
10 (the ‘<=’ stands for “less than or equal to”).
Now, we can create our histogram. Highlight the bin labels and count data, and select the Column -> 2D
Column chart option above. This generates a histogram of the data (just a column bar chart of the count for
each bin). We can improve the look of the chart by removing the space between bars. Right-click on one of the
bars, select Format Data Series, and move the Gap Width scroller all the way to No Gap. Then, we can label
our vertical axis by selecting Chart Tools -> Layout ->Axis Titles ->Primary Vertical Axis ->Rotated Title
above. We enter the axis label “Frequency” in the box provided next to the axis. We also enter a title,
“Histogram of Exercise Times”, by selecting the Chart Tools -> Layout -> Chart Title -> Above Chart option.
Excel Users Guide- 6
Statistics: Unlocking the Power of Data
2.4 Mean, Median, Standard Deviation, and Percentiles
The mean, median, and standard deviation of a quantitative
variable can be computed by entering ‘=AVERAGE()’,
‘=MEDIAN()’, and ‘=STDEV()’ in the formula bar. For
example, consider the exercise times from Section 2.3, in
cells C2 to C17. We can compute the mean of these
exercise times by clicking on an empty cell and then
entering ‘=AVERAGE(C2:C17)’ in the formula bar.
Entering ‘=MEDIAN(C2:C17)’ gives the median for the
exercise times and ‘=STDEV(C2:C17)’ gives the standard
deviation.
The ‘=PERCENTILE(range, x)’ function computes the x’th
percentile in a range of values. Here ‘x’ is a number
between 0 and 1, so x=0.95 corresponds to the 95th
percentile. For example we can compute the first quartile
(25th percentile) of the exercise values by entering ‘=PERCENTILE(C2:C17,0.25)’ in the formula bar. The
MIN() and MAX() functions are also useful. MAX(C2:C17) gives the maximum value among the exercise
times, and MIN(C2:C17) gives the minimum.
2.5 Boxplots
Excel does not have the capability to generate boxplots automatically. They can be created manually, but this
involves several steps (for more details, see the Microsoft support article at this link:
http://support.microsoft.com/kb/155130). We suggest using other software (such as the StatKey applets:
http://lock5stat.com/statkey/) to create boxplots. However, Excel can be used to find the five-number summary
(as in Section 2.4), and then it is straightforward to create a boxplot by hand.
2.6 Correlation, Scatterplots, and Linear Regression
The function ‘=CORREL(range1,range2)’ can be used to calculate
correlations. For example, let’s add another quantitative variable to
the student survey data from Section 2.3, the number of hours spent
watching TV. We wish to find the correlation between Exercise
(in column C) and TV (in column D). We choose an empty cell
and enter ‘=CORREL(C2:C17,D2:D17)’, which computes the
correlation between the two variables.
Excel Users Guide- 7
Statistics: Unlocking the Power of Data
The Scatter option can be used to make a scatterplot of two quantitative variables. To illustrate, we highlight
the data for Exercise and TV and choose Insert -> Scatter ->Scatter with only Markers. We use the Chart
Tools options above to add axis labels and a title.
After creating our scatterplot, we can select the Chart Tools -> Layout -> Trendline -> Linear Trendline option.
This displays the least squares regression line on the chart. To display the equation for this line in the form
y=ax + b, we select Trendline -> More Options and then check the Display Equation on Chart option.
Excel Users Guide- 8
Statistics: Unlocking the Power of Data
3.) Using Excel for Chapters 3 and 4
Excel has no built-in capabilities to do the bootstrapping and randomization procedures that are introduced in
Chapters 3 and 4. There are free add-ins available for download online (such as PopTools:
http://www.poptools.org/) that allow for these capabilities. However, these are clunky and not very intuitive.
We suggest using other software (such as the StatKey applets: http://lock5stat.com/statkey/) to perform the
methods described in Chapters 3 and 4.
Excel Users Guide- 9
Statistics: Unlocking the Power of Data
5.) Using Excel for Theoretical Distributions (Chs 5-10)
5.1 Finding Normal probabilities
The NORMDIST function can be used to calculate the probabilities of a normal distribution in Excel. Entering
‘=NORMDIST(x,mu,sigma,TRUE/FALSE)’ calculates a normal probability where ‘x’ is a value, ‘mu’ and
‘sigma’ are the mean and standard deviation of the normal distribution. For our purposes the final argument of
the function will always be ‘TRUE’, which calculates the area of
everything less than ‘x’ in the normal curve (set this value to ‘FALSE’ to
compute the normal density, which is not relevant for the material in the
textbook). The result will always be a probability between 0 and 1. For
example, entering ‘=NORMDIST(1.96,0,1,TRUE)’ gives approximately
0.975 . To find the upper tail probability for ‘x’, use
‘=1 − NORMDIST(x,mu,sigma,TRUE)’. We can find the area between -1.96 and 1.96, for example, by
entering ‘=NORMDIST(1.96,0,1,TRUE) − NORMDIST(-1.96,0,1,TRUE)’, which is about 0.95.
The NORMINV function takes an area under the normal curve and gives a value. Specifically,
‘=NORMINV(p,mu,sigma)’ gives the value ‘x’ such that the area of everything less than ‘x’ on the normal
curve is ‘p’. For example, entering ‘=NORMINV(0.975,0,1,TRUE)’ gives approximately 1.96.
As an illustration of how we can use normal probabilities in Excel, consider the Gallup poll data in Example 6.7
of the textbook. In this example we test the hypothesis that the proportion of American adults who approve of
the way Congress is handling its job is p=0.20. Under this hypothesis, the sample proportion from the Gallup
poll of n=1013 individuals has distribution
𝑝̂ ~𝑁 (0.20, √
0.20(1 − 0.20)
).
1013
In Excel, we enter the mean 0.20 in cell C1 and compute the SE in
cell C2: ‘=(0.2*(1-0.2)/1013)^0.5’. The sample proportion 𝑝̂ is
0.19, so we can use the NORMDIST function to calculate a onesided p-value: ‘=NORMDIST(0.19,C1,C2,TRUE)’.
5.2 Finding t-Distribution probabilities
The TDIST function is used to calculate probabilities from the t-Distribution with specified degrees of freedom.
Entering ‘=TDIST(x,df,tails)’ calculates the probability using a standard tDistribution with ‘df’ degrees of freedom, for value ‘x’. If ‘tails’ is 1 then a
one-sided probability is given (the area under the curve less than ‘x’), and if
‘tails’ is 2 then a two-sided probability is given (the area less than ‘x’ and
greater than ‘-x’). For example, ‘=TDIST(1,10,2)’ gives the result of
approximately 0.34, whereas ‘=TDIST(1,10,1)’ returns 0.17.
Excel Users Guide- 10
Statistics: Unlocking the Power of Data
5.3 Finding Chi-Square probabilities
The CHIDIST function is used to find Chi-Square probabilities. Entering
‘=CHIDIST(x,df)’ returns the area greater than ‘x’ in a standard ChiSquare distribution with ‘df’ degrees of freedom. For example,
‘=CHIDIST(3,10)’ returns approximately 0.98.
5.4 Finding F-distribution probabilities
The FDIST function is used to find F-distribution probabilities. Entering
‘=FDIST(x,df1,df2)’ returns the area less than ‘x’ in a standard Fdistribution with ‘df1’ numerator degrees of freedom and ‘df2’denominator
degrees of freedom. For example, ‘=FDIST(2,5,10)’ returns approximately
0.16.
Excel Users Guide- 11
Statistics: Unlocking the Power of Data
6.) Using Excel for Tests for Means (Chapter 6)
Excel can be used to perform some hypothesis tests that use a theoretical distribution automatically, but the
options are somewhat limited. For example, you can use the Analysis Toolpak to do a test to compare two
means (as described below), but there are no automatic procedures for intervals or tests for proportions. We
give instructions for some of these tests.
Two sample t-tests
The Analysis Toolpak add-in (see Section 0.3) makes it possible to run two sample t-tests quickly. To do a twosample t-test, select Data->Data Analysis and then t-Test: Two-Sample Assuming Unequal Variances. There
are also similar choices in this window for a two-sample t-Test assuming equal variance, and a t-Test for
matched-pairs data. After choosing a test, a window pops up that allows you to specify the data range for
variable 1 and the data range for variable 2 (these are the two sample sets you are comparing). You can also
specify the hypothesized difference (this will usually be 0) and a significance threshold. Click OK to run the
test. This will open a new sheet (by default) with several statistics for the test, including summary statistics for
each variable, a one-tailed p-value, and a two-tailed p-value.
Excel Users Guide- 12
Statistics: Unlocking the Power of Data
7.) Using Excel for Chi-square Goodness-of Fit (Ch. 7)
Chi-Square Test for Goodness-of-Fit
A chi-Square goodness-of-fit test can be performed automatically using the CHITEST function. Entering
‘=CHITEST(obs_range,exp_range)’ computes a Chi-Square p-value for the observed counts given in cells
‘obs_range’ versus the expected counts given in cells ‘exp_range’. For example, if the observed counts are
given in cells D to G in row 2, and the expected counts are given in cells D to G below in row 3,
‘=CHITEST(D2:G2, D3:G3)’ will compute the goodness-of-fit p-value in one step.
If you have the proportions under the null hypothesis in cells of the spreadsheet, you can use a formula to
compute the expected counts by multiplying those cells by the sample size.
Chi-Square Test for Association for Two Categorical Variables
There are no facilities in Excel for doing the chi-square test for association for two categorical variables.
Excel Users Guide- 13
Statistics: Unlocking the Power of Data
8.) Using Excel for ANOVA for Means (Ch. 8)
ANOVA for Means
The Analysis Toolpak add-in (see Section 0.3) makes it possible to run a one-way ANOVA for difference in
means analysis quickly. To do an ANOVA test, select Data->Data Analysis and then ANOVA: Single Factor.
A window pops up that allows you to specify a range of data values. The groups for ANOVA are taken to be
the different rows or columns in this range. For example, if there are 3 experimental groups and 6 values for
each group, the data can be organized in a 6 X 3 block of cells. You can also select a significance threshold.
Click OK to run the test. This will open a new spreadsheet (by default) with several statistics for the test,
including the between-group sum of squares, the within-group sum of squares, the F-statistic and p-value.
Excel Users Guide- 14
Statistics: Unlocking the Power of Data
9.) Regression (Chs 9-10)
9.1 Inference for a single predictor
The Analysis Toolpak add-in (see Section 0.3) makes it possible to automatically run a regression analysis. To
do a regression analysis, select Data->Data Analysis and then Regression. A window pops up that allows you
to specify a range of X (predictor) values and a range of Y (response) values. After we click OK a new
spreadsheet will open (by default) with regression statistics. These include the R2 value (line 5), the coefficient
and p-value for the intercept (line 17), and the coefficient and p-value for the slope (line 18).
9.2 Inference for multiple predictors
The Regression option in the Analysis Toolpak can also be used to do inference for multiple predictors. We
simply input multiple columns in the X Range field. After we click OK a new spreadsheet will open with
inference statistics for each of the predictors (lines 18, 19, etc.).
Download