Sample size vs. Error A tutorial By Bill Thomas, Colby-Sawyer College Introduction In the pipetting tutorial, you explored the utility of the mean, the standard deviation and the relative error in describing the reproducibility and accuracy of a sample. You also learned a few tricks for working more efficiently within Excel. In this tutorial, we are going to explore the relationship between sample size and variability. How well do our descriptors (mean, st. dev.) work to tell us if our sample is representative of the population we are trying to describe, or evaluate. For this we will introduce a new descriptor, the Standard Error of the Mean (SEM), and we will see how it varies with the sample size. Along the way, we will also gain a bit more experience using Excel. First, what do we mean by the “population” and the “sample”? Let’s suppose that our population were the numbers from 1 to 10. There would be 10 members of the population, and it would not be difficult to sample (e.g., consider, do an experiment on, take data on, or do our calculations on) every member of the population. However, what if our population had 10,000 elements (or even more)? It would be impractical, verging on impossible, to treat every member of the population separately. Thus we frequently select a sample population to represent the full population. This approach makes the work easier, but it raises the question of how representative the sample population is of the entire, or full population. Consider the images on the following slide. Full population = Sample population The sample population IS the full population; this is ideal Sample population Full population The sample population is much smaller than the full population, but the full population is uniform, so the sample is representative of the full population. This is rarely the case, unfortunately. Sample population Full population Here the full population is non-uniform, so a small sample cannot be truly representative of the full population. This is most often the case we face in reality, and the question is, how to sample appropriately under these circumstances. Let’s begin by getting a sense for the kind of variability we might encounter. For this we will use a new tool in Excel, a random number generator, that will allow us to create randomly generated populations of numbers of any size and within any limits. The function looks like this: =RANDBETWEEN(x,y) It will generate in the cell in which it is written a random number between the set limits x and y. Thus, if you specify 2 and 4, it will produce a number in the chosen cell between 2 and 4. Let’s use this function. Open a new Excel file, and in cell A1 type in the function =RANDBETWEEN(9,11). When you click “enter”, there will appear a number between 9 and 11. Drag click this cell down to fill 9 more cells. Each of the 10 cells should now contain a number between 9 and 11. Highlight the 10 vertical cells and drag click them over to column P. You should have an array 16 columns by 10 numbers, each number randomly generated, that looks like this: Note that the number in each position of your set will be different that that shown here. Now, there is a little quirk about the function that you just used. Each time you attempt to copy and paste a cell containing it, the number in the cell changes. While this feature is useful for some purposes, it makes what you are about to do a bit more challenging. We need to have numbers that do not change as you work with them, so here is what to do. Select all 160 numbers and copy them. Then put the cursor on cell C13. If you have a PC, right click on the cell and select the option “paste special”. If you have a Mac, highlight cell C13 and under the edit option on the menu bar, select “paste special”. In each case, you will be given a menu, the first choice on which is “paste”. Under the paste submenu, select “values” and click OK. The numbers that appear will now be fixed; they will not vary when you manipulate them. Next, as preparation for the steps to come, color the cells and add in the other details shown below. Remember: A cell with a yellow background color is a cell into which you type a value, a number (like all the values above). A cell with a salmon background color is a cell in which you must write an equation to generate the number shown (which will be the case with most of the steps to follow). Now, let’s think for a minute about the numbers that you have generated. The range allowed was from 9 to 11 (or 10 +/- 1), so you can see that the average of these numbers ought to be midway between 9 and 11, or 10. Let’s use Excel to calculate the mean of the first 3 numbers in each vertical column of 10 numbers to see how close it is to 10. Set your spread sheet up as shown below, being sure to calculate the mean and (below it) the standard deviation for the first three numbers in each column. Now look at your 16 means. How similar are they? How close to the expected value of 10 are they? Are you surprised? Let’s visualize the distribution. Create a scatter plot (without a line connecting the data points) of the means with the axes shown below: Distribution of sample means 12 sample mean 11 10 9 8 0 2 4 6 8 10 # of samples 12 14 16 18 There is a number that describes the variation within each data set, and that number is the standard deviation. Calculate the standard deviation for each of your sets of 3 as shown below. Generate a scatter plot of the standard deviations, as well. What does it tell you about the variation within your data? The standard deviation expresses the variation within the sample set, but it does not really tell us how well the sample represents (estimates the values for) the full population that we are trying to evaluate. For that we need a new term, the standard error of the mean. S.E.M. = standard deviation of the sample/square root of the sample size Use the standard deviations that you just calculated to calculate the standard error for each sample of 3. Use a scatter plot to visualize the distribution of these values as well. It should be clear that there is quite a bit of variation in the data when they are sampled three at a time. Are samples of this size good at representing the full population from which they are taken? What would happen, if we took larger samples from the full population? Begin by doing the same calculations for each full column of 10 (not just the first three values in the column). Add these new data to each of your three scatter plots. Now take the data twenty values at a time; calculate the mean, st. dev. and SEM for two columns of data. You will have half as many calculated values as before. Again, add these new data to each of your three scatter plots. Complete this exercise in three more successive steps for the data taken 40 values at a time, 80 values at a time, and 160 values at a time, as shown below. Note that 160 values constitutes the entire population. 20 20 40 20 20 40 20 20 20 20 40 40 80 80 160 Conclude by adding these new data to each of your three scatter plots. Do your plots show you a pattern? What is that pattern? In the scatter plots that you have generated thus far, you have illustrated the distributions of your various calculations, showing each descriptor for each population evaluated. It is possible, and often very helpful, to visualize the trends in the data with summary graphs. To this end, calculate to the far right as shown, the means for all the values grouped horizontally across each line. Thus the value to the right of the first line will be the mean of all 16 means of the data taken three values at a time. The second line will be the mean of all the st. devs. of the the data taken three values at a time, and on down the column to the last value, which will be just the values calculated for the entire population Graph the mean standard errors calculated against the sample size as indicted on the axes below. Mean SEM vs sample size 1.2 Mean standard error 1 0.8 0.6 0.4 0.2 0 0 20 40 60 80 100 120 140 160 180 # data points per sample set Do you see a trend? How big does the data set have to be to fulfill the trend? Let’s explore that a bit. Let’s create a really big data set. Use the random number generator to create a column of 1000 values. Then calculate the mean, st. dev., and SEM for the sample sets having 3, 10, 20, 50, 100, 200, 500 and 1000 values (the last being the entire population). Then make two graphs, one of the mean, the other of the SEM, plotted against sample size. Do you remember that we assumed at the beginning that the mean for a population distributed randomly about 10 should have a mean of 10? What do the data show you? 1000 Similarly, how big does the sample have to be before the SEM of the sample approximates the SEm of the entire population? Some things to try: Now that you have the calculator and graphs all set up, you can “play” pretty easily. What happens to your values and graphs from the previous slide if you change the spread of the data?. For example, what if you used RANDBETWEEN(7,13)? How about RANDBETWEEN(5,15)? Does changing the size of the data (2000, 5000, 10,000 values) set have any interesting effect on these outcomes? (Don’t be intimidated. The numbers look huge, but you can create such sets in seconds). How big does each population have to be for the mean to resolve to the expected value? Finally, what have you learned about sampling a population with inherent variability?