Sample size vs. Error - Colby

advertisement
Sample size vs. Error
A tutorial
By Bill Thomas, Colby-Sawyer College
Introduction
In the pipetting tutorial, you explored the utility of the mean, the standard
deviation and the relative error in describing the reproducibility and accuracy
of a sample. You also learned a few tricks for working more efficiently within
Excel.
In this tutorial, we are going to explore the relationship between sample size
and variability. How well do our descriptors (mean, st. dev.) work to tell us if
our sample is representative of the population we are trying to describe, or
evaluate. For this we will introduce a new descriptor, the Standard Error of the
Mean (SEM), and we will see how it varies with the sample size.
Along the way, we will also gain a bit more experience using Excel.
First, what do we mean by the “population” and the “sample”?
Let’s suppose that our population were the numbers from 1 to 10. There would be
10 members of the population, and it would not be difficult to sample (e.g.,
consider, do an experiment on, take data on, or do our calculations on) every
member of the population.
However, what if our population had 10,000 elements (or even more)? It would be
impractical, verging on impossible, to treat every member of the population
separately. Thus we frequently select a sample population to represent the full
population. This approach makes the work easier, but it raises the question of how
representative the sample population is of the entire, or full population.
Consider the images on the following slide.
Full
population
=
Sample
population
The sample population
IS the full population;
this is ideal
Sample population
Full
population
The sample population is much smaller than the
full population, but the full population is uniform,
so the sample is representative of the full
population. This is rarely the case, unfortunately.
Sample population
Full
population
Here the full population is non-uniform, so a
small sample cannot be truly representative of
the full population. This is most often the case
we face in reality, and the question is, how to
sample appropriately under these circumstances.
Let’s begin by getting a sense for the kind of variability we might encounter. For
this we will use a new tool in Excel, a random number generator, that will allow
us to create randomly generated populations of numbers of any size and within
any limits. The function looks like this:
=RANDBETWEEN(x,y)
It will generate in the cell in which it is written a random number between the
set limits x and y. Thus, if you specify 2 and 4, it will produce a number in the
chosen cell between 2 and 4.
Let’s use this function.
Open a new Excel file, and in cell A1 type in the function =RANDBETWEEN(9,11).
When you click “enter”, there will appear a number between 9 and 11. Drag click
this cell down to fill 9 more cells. Each of the 10 cells should now contain a number
between 9 and 11. Highlight the 10 vertical cells and drag click them over to
column P. You should have an array 16 columns by 10 numbers, each number
randomly generated, that looks like this:
Note that the number in each position of your set will be different that that shown here.
Now, there is a little quirk about the function that you just used. Each time you
attempt to copy and paste a cell containing it, the number in the cell changes.
While this feature is useful for some purposes, it makes what you are about to do a
bit more challenging.
We need to have numbers that do not change as you work with them, so here is
what to do.
Select all 160 numbers and copy them. Then put the cursor on cell C13. If you have
a PC, right click on the cell and select the option “paste special”. If you have a
Mac, highlight cell C13 and under the edit option on the menu bar, select “paste
special”.
In each case, you will be given a menu, the first choice on which is “paste”. Under
the paste submenu, select “values” and click OK.
The numbers that appear will now be fixed; they will not vary when you manipulate
them.
Next, as preparation for the steps to come, color the cells and add in the
other details shown below.
Remember: A cell with a yellow background color is a cell into which
you type a value, a number (like all the values above). A cell with a
salmon background color is a cell in which you must write an equation
to generate the number shown (which will be the case with most of the
steps to follow).
Now, let’s think for a minute about the numbers that you have generated.
The range allowed was from 9 to 11 (or 10 +/- 1), so you can see that the
average of these numbers ought to be midway between 9 and 11, or 10.
Let’s use Excel to calculate the mean of the first 3 numbers in each vertical
column of 10 numbers to see how close it is to 10. Set your spread sheet
up as shown below, being sure to calculate the mean and (below it) the
standard deviation for the first three numbers in each column.
Now look at your 16 means. How similar are they? How close to the
expected value of 10 are they? Are you surprised?
Let’s visualize the distribution. Create a scatter plot (without a line
connecting the data points) of the means with the axes shown below:
Distribution of sample means
12
sample mean
11
10
9
8
0
2
4
6
8
10
# of samples
12
14
16
18
There is a number that describes the variation within each data set, and that
number is the standard deviation. Calculate the standard deviation for each of
your sets of 3 as shown below.
Generate a scatter plot of the standard deviations, as well. What does it tell
you about the variation within your data?
The standard deviation expresses the variation within the sample set, but it does not
really tell us how well the sample represents (estimates the values for) the full
population that we are trying to evaluate. For that we need a new term, the standard
error of the mean.
S.E.M. = standard deviation of the sample/square root of the sample size
Use the standard deviations that you just calculated to calculate the standard error
for each sample of 3.
Use a scatter plot to visualize the distribution of these values as well.
It should be clear that there is quite a bit of variation in the data when they are
sampled three at a time. Are samples of this size good at representing the full
population from which they are taken?
What would happen, if we took larger samples from the full population? Begin
by doing the same calculations for each full column of 10 (not just the first
three values in the column).
Add these new data to each of your three scatter plots.
Now take the data twenty values at a time; calculate the mean, st. dev. and SEM
for two columns of data. You will have half as many calculated values as before.
Again, add these new data to each of your three scatter plots.
Complete this exercise in three more successive steps for the data taken 40
values at a time, 80 values at a time, and 160 values at a time, as shown below.
Note that 160 values constitutes the entire population.
20
20
40
20
20
40
20
20
20
20
40
40
80
80
160
Conclude by adding these new data to each of your three scatter plots. Do
your plots show you a pattern? What is that pattern?
In the scatter plots that you have generated thus far, you have illustrated the
distributions of your various calculations, showing each descriptor for each population
evaluated. It is possible, and often very helpful, to visualize the trends in the data with
summary graphs. To this end, calculate to the far right as shown, the means for all the
values grouped horizontally across each line.
Thus the value to the right of the first line will be the mean of all 16 means of the data
taken three values at a time. The second line will be the mean of all the st. devs. of the
the data taken three values at a time, and on down the column to the last value, which
will be just the values calculated for the entire population
Graph the mean standard errors calculated against the sample size as
indicted on the axes below.
Mean SEM vs sample size
1.2
Mean standard error
1
0.8
0.6
0.4
0.2
0
0
20
40
60
80
100
120
140
160
180
# data points per sample set
Do you see a trend? How big does the data set have to be to fulfill the trend?
Let’s explore that a bit.
Let’s create a really big data set.
Use the random number generator
to create a column of 1000 values.
Then calculate the mean, st. dev.,
and SEM for the sample sets having
3, 10, 20, 50, 100, 200, 500 and
1000 values (the last being the
entire population). Then make two
graphs, one of the mean, the other
of the SEM, plotted against sample
size.
Do you remember that we
assumed at the beginning that the
mean for a population distributed
randomly about 10 should have a
mean of 10? What do the data
show you?
1000
Similarly, how big does the sample
have to be before the SEM of the
sample approximates the SEm of
the entire population?
Some things to try:
Now that you have the calculator and graphs all set up, you can “play” pretty easily.
What happens to your values and graphs from the previous slide if you change the
spread of the data?. For example, what if you used RANDBETWEEN(7,13)? How
about RANDBETWEEN(5,15)?
Does changing the size of the data (2000, 5000, 10,000 values) set have any
interesting effect on these outcomes? (Don’t be intimidated. The numbers look
huge, but you can create such sets in seconds). How big does each population have
to be for the mean to resolve to the expected value?
Finally, what have you learned about sampling a population with inherent
variability?
Download