Descriptive statistics One of my pleasures is growing roses. In my dreams, I have a garden full of 200 rose plants. I'd like to maximize the number of flowers on each of my rose plants, so I'll plan some experiments. Before I begin my experiments, I'll collect some information on my 200 roses: what is the average number of flowers on each rose plant, and how variable is the number of flowers? Being enthusiastic, I measure the number of flowers on each of my 200 roses. The graph on the left shows the result. Each circle represents one rose plant. The graph on the right shows a histogram for same data. Population versus sample The 200 roses in my garden is the entire population that I'm interested in working on. Another population I might be interested in could be the set of all pills produced in a particular factory during a particular year. For both these examples, it is relatively easy to count and measure every member of the population. It is unusual that the entire population we are interested in is small and easily defined. More commonly, we could be interested in the population consisting of all people who might be given a drug we are developing. It would be difficult to count and measure the effect of our drug in people who may not even be born yet. Or we might be interested in the population of all members of a species of beetle. They would be quite difficult to count and measure. Since it is often difficult or impossible to count and measure every member of a population we are interested in, we usually take a sample from that population. We then try to make inferences about the population based on our samples. When I do my rose experiments, I usually won't want to do the work of measuring all 200 roses again. Instead, I'll likely take a sample of, say, 10 or 20 roses. When I do a clinical trial, I can't measure all the patients who might take the drug, so I might take a sample of, say, 300 patients. A typical value: the population mean Because I measured the number of flowers on all 200 roses in my garden, I can calculate the mean of all 200. The mean of all 200 roses in my population is called the population mean. As we'll see shortly, the mean of a sample of, say, 10 roses from my garden (the sample mean) will be close to, but rarely exactly the same as, the mean of all 200 (the population mean). To calculate the mean (average) of N numbers, add up all the N numbers and divide by N. The Greek letter (mu) is the population mean. Population mean = = Xi/N The symbol in the formula is the Greek capital letter Sigma. Xi indicates adding up all N of the Xi observations. If you are not familiar with using notation, see the end of this chapter for an explanation Suppose my population is 5 small apple trees growing in my garden. I count the number of ripe apples on each tree. Tree number 1 2 3 4 5 Number of ripe apples 8 8 10 12 12 The sum of all 5 numbers is 50, so the mean number of ripe apples is 50/5 = 10 ripe apples. Population mean = = Xi/N = 50/5 = 10. The mean has two interesting properties. 1. The sum of the distances of the observations from the mean is always zero. 2. The mean minimizes the sum of the squared distance of all the observations from the mean. If you compare the sum of the squared distance of all the observations to any value other than the mean, the sum of the squared distance from the mean will always be greater than the sum for the mean. In statistics, we often describe the distance of an observation from the mean as the deviation from the mean. For my 200 roses, the number of flowers on each rose is in the Excel file "flowercounts.xlsx" on the book website. In that file, you see that the population mean is = 19.65 flowers per rose. Population variance and standard deviation I'm also interested in the variability of the number of ripe apples on my trees, and will be interested in the variability of the number of flowers on my roses. A common goal of designed experiments is to reduce variability. For example, we might want to reduce the variability of the yield of a process reduce the variability in the amount of drug in a pill reduce the variability of replicate measures in an assay We could describe the variability of the number of flowers by giving the highest and the lowest values (the range of values). But the range is not a very good descriptor of variability, because it can be greatly affected by a single unusual point. A single incorrect measurement or an outlier could give very extreme values. The most widely used descriptors of variability are the variance and the standard deviation. population variance = 2 (sigma squared) population standard deviation = (sigma) Population variance and standard deviation We calculate a population variance to describe the variability of observations around the population mean. population variance = = 2 2 ∑𝑁 𝑖=1( 𝑥𝑖 − 𝜇) 𝑁 The population standard deviation is just the square root of the population variance. 𝑁 ∑𝑖=1( 𝑥𝑖 − 𝜇) population standard deviation = =√ 2 𝑁 Tree number 1 2 3 4 5 Number of ripe apples 8 8 10 12 12 So for our ripe apples example, we have population variance = 2 = 2 ∑𝑁 𝑖=1( 𝑥𝑖 − 𝜇) 𝑁 = [(8 - 10) 2 + (8 - 10)2 + (10 – 10)2 + (12 - 10)2 + (12 - 10)2 /(5) = [4 + 4 + 0 + 4 + 4]/5 = 16/5 = 3.2 ripe apples2 The population standard deviation is the square root of the variance. =√ 2 ∑𝑁 𝑖=1( 𝑥𝑖 − 𝜇) 𝑁 = sqrt(3.2) = 1.79 ripe apples. For my 200 roses, from the Excel file "flowercounts.xlsx". In that file, you see that the population standard deviation is = 5.02 flowers per rose. Sample mean, variance and standard deviation When we evaluate a new drug we might wish that we could collect observations on every member of the population, and calculate the population variance and the population standard deviation. But that is usually not possible. In clinical trials and experiments, we take a random sample of N observations from a (much larger, possibly infinite) population. The sample mean, sample variance and sample standard deviation are estimates of the true population mean, population variance and the population standard deviation. We will use different symbols when we calculate these descriptive statistics using the N observations in a sample: sample mean = X sample variance = s2 sample standard deviation = s We use an X with a bar over the top, X , as the symbol for the sample mean. Sample variance = s2 X i X N i 1 = 2 N 1 The sample standard deviation, s, is the square root of the sample variance s2. 𝑁 ∑𝑖=1( 𝑥𝑖 − 𝑥̅ ) s=√ 𝑁−1 2 Notice that, for the population standard deviation, we divided by N. For the sample standard deviation, we divide by N-1. Dividing my N-1 gives us a more accurate estimate of the true population standard deviation. Here is an example of calculating the sample mean and standard deviation. In this case, we calculate the mean number of flowers on five plants. These N = 5 plants are a sample of all the plants that we could measure by repeating the experiment. Plant number 1 2 3 4 5 Number of flowers 8 8 10 12 12 To calculate the mean of N numbers, add up all the N numbers and divide by N. For the five plants N is 5. The sum of all 5 numbers is 50, so the mean number of flowers is 50/5 = 10 flowers: Mean number of flowers = X = (8 + 8 + 10 + 12 + 12) / 5 = 50/5 = 10 flowers. We calculate the sample variance to describe the variability of observations around the sample mean. Sample variance = s2 X i X N i 1 = 2 N 1 = [(8 - 10) 2 + (8 - 10)2 + (10 – 10)2 + (12 - 10)2 + (12 - 10)2 /(5-1) = [4 + 4 + 0 + 4 + 4]/4 = 16/4 = 4 flowers2 The variance has units of flowers2, flowers squared. We’d like to have a measure of variability in flowers, the same units as the original measurements. The sample standard deviation, s, has the same units as the original measurements Sample standard deviation = s = Square root (sample variance) = Square root (s2) = Square root (4 flowers2) = 2 flowers. 22 = 4 In our sample of 5 plants, the sample standard deviation of the number of flowers on each plant is s = 2 flowers. The sample variance is s2 = 4 flowers2. Standard Error of the Mean (SEM): An intuitive explanation Often, we want to test if there is a difference between the means of two groups. For example, I'd like to test if the mean number of flowers per rose is different when I do or do not use fertilizer. To test for differences in the means, I need an accurate estimate of the true population mean for each treatment group. We want to quantify how accurately the sample mean estimates the population mean. To quantify the error in estimating the population mean we use the standard error of the mean, which we'll define next. The symbol for standard error of the mean (SEM) is X . To illustrate, the concept of the standard error, let's start by taking some random samples from our population of 200 roses and seeing how close the mean of each sample is to the population mean = 19.65 flowers per rose. The graph shows the 10 roses selected in our first random sample, Sample 1. The black boxes in the graph show the 10 roses selected. The number of flowers on the 10 roses in Sample 1 is 14, 15, 18, 19, 21, 21, 23, 27, 28, and 28. The mean number of flowers on the 10 roses is 21.4. We'll take two more samples of 10 roses each to give us Sample 2 and Sample 3: Here are the number of flowers on the roses in each sample, and the sample mean. Sample 1 2 3 N 10 10 10 Number of flowers on each rose 14, 15, 18, 19, 21, 21, 23, 27, 28, 28 6, 13, 19, 20, 20, 20, 21, 23, 27, 27 7, 8, 15, 16, 19, 20, 21, 21, 21, 26 Sample mean 21.4 19.6 17.4 The means of these three samples are distributed around the population mean of 19.65. If, instead of 3 samples, we take 100 samples, what would the distribution of sample means look like? Here are the means of 100 random samples, each of size N=10, from the 200 roses. [1] [16] [31] [46] [61] [76] [91] 15.9 18.3 18.9 19.5 20.2 20.8 21.9 16.1 18.4 18.9 19.5 20.2 20.8 21.9 16.2 18.4 19.0 19.6 20.2 20.9 21.9 16.6 18.4 19.0 19.6 20.3 20.9 22.2 16.7 18.5 19.0 19.6 20.3 21.0 22.2 16.9 18.6 19.1 19.7 20.3 21.1 22.3 17.1 18.6 19.1 19.8 20.3 21.3 23.1 17.2 18.6 19.2 19.8 20.4 21.3 23.2 17.6 18.7 19.2 19.8 20.4 21.3 23.3 17.6 18.7 19.2 19.9 20.4 21.4 24.6 17.6 18.7 19.2 19.9 20.6 21.4 17.7 18.7 19.2 19.9 20.6 21.5 17.8 18.8 19.3 19.9 20.7 21.6 17.9 18.9 19.3 19.9 20.7 21.7 18.0 18.9 19.4 20.1 20.8 21.7 Here is a histogram showing the distribution of the sample means, and a histogram of the population of 200 roses. Notice that the distribution of the sample means is centered near 19. The mean of the 100 sample means is 19.7, close to the true population mean of 19.65. The standard deviation of the 100 sample means 1.65. What would the distribution of sample means look like if we had a population with a small standard deviation? Let's look at a population with a smaller standard deviation than the rose population: a population of 160 irises. Here are the number of flowers on each of 160 irises, and a histogram showing the distribution. [1] [28] [55] [82] [109] [136] 14 18 18 20 21 22 15 18 19 20 21 22 15 18 19 20 21 22 16 18 19 20 21 22 16 18 19 20 21 22 16 18 19 20 21 22 16 18 19 20 21 22 17 18 19 20 21 22 17 18 19 20 21 22 17 18 19 20 21 22 17 18 19 20 21 22 17 18 19 20 21 22 17 18 19 20 21 22 17 18 19 20 21 22 17 18 19 20 21 22 17 18 19 20 21 23 17 18 19 20 21 23 17 18 19 20 21 23 17 18 19 20 21 23 17 18 19 20 21 23 17 18 19 20 22 23 18 18 19 20 22 24 18 18 19 20 22 24 18 18 20 20 22 25 18 18 20 20 22 25 18 18 20 20 22 18 18 20 20 22 The histogram on the left shows the distribution of the number of flowers on the irises. The histogram on the right shows the distribution of the number of flowers on the roses. Notice that the distribution of the number of flowers on the iris plants is narrower than the distribution for the roses. Here are the population mean and standard deviation for the roses and irises. Plant Population mean Population standard deviation Rose 19.65 5.02 Iris 19.60 2.03 Now, let's take 100 samples from the iris population, and calculate the mean for each sample. Here are the means of 100 random samples, each of size N=10, from the 160 irises. [1] [16] [31] [46] [61] [76] [91] 15.9 18.3 18.9 19.5 20.2 20.8 21.9 16.1 18.4 18.9 19.5 20.2 20.8 21.9 16.2 18.4 19.0 19.6 20.2 20.9 21.9 16.6 18.4 19.0 19.6 20.3 20.9 22.2 16.7 18.5 19.0 19.6 20.3 21.0 22.2 16.9 18.6 19.1 19.7 20.3 21.1 22.3 17.1 18.6 19.1 19.8 20.3 21.3 23.1 17.2 18.6 19.2 19.8 20.4 21.3 23.2 17.6 18.7 19.2 19.8 20.4 21.3 23.3 17.6 18.7 19.2 19.9 20.4 21.4 24.6 17.6 18.7 19.2 19.9 20.6 21.4 17.7 18.7 19.2 19.9 20.6 21.5 17.8 18.8 19.3 19.9 20.7 21.6 17.9 18.9 19.3 19.9 20.7 21.7 18.0 18.9 19.4 20.1 20.8 21.7 Here is a histogram showing the distribution of the iris sample means, and a histogram of the population of 160 irises. From the histogram you can see that the distribution of the sample means for the irises is centered near 19. The mean of the 100 sample means is 19.5, close to the true population mean of 19.6. The standard deviation of the 100 sample means 0.65. Here are graphs to summarize the rose and iris example. Here are the population mean and standard deviation for the roses and irises. Plant Sample size Rose 10 Iris 10 Population mean 19.65 19.60 Population standard deviation 5.02 2.03 Standard deviation of sample means 1.65 0.65 The graphs show these results. The roses have a larger population standard deviation, and a larger standard deviation of the sample means. The irises have a smaller population standard deviation, and a smaller standard deviation of the sample means. When we use a sample to estimate the true population mean, the error in our estimate is affected by the population standard deviation. This example is an instance of a general law: the standard error of the mean deviation, X is proportional to the population standard If the population of number of flowers has very small standard deviation, then the samples from that population will have small sample standard deviation the sample means will be close to the population mean and we will have a small standard error of the mean. If the population of number of flowers has a large standard deviation, then the samples from that population will have large sample standard deviation the sample means may be far from the population mean and we will have a large standard error of the mean. Large population variability causes a large standard error of the mean. Now let's look at how the number of observations in a sample (sample size N) affects how accurately the sample mean estimates the population mean. If the number of observations in our sample is only N = 2, we’re not very confident that the sample mean will be close to the population mean. On the other hand, if we have N = 100 or N = 1000, we start to be a lot more confident that the mean of any given sample will be close to the population mean. The estimate of the population mean using 2 observations is less reliable than the estimate using 20 observations and much less reliable than the estimate of the mean using 100 observations. As the sample size N gets bigger, we expect our error in estimating the population mean to get smaller. Here's an example using the roses, with sample size of N = 4, 9, or 25. Sample size N 4 9 25 Standard deviation of 100 sample 2.44 1.61 0.88 This example is an instance of the general case: standard error of the mean X is proportional to 1/N. Unfortunately, the standard error only goes down proportionally to the square root of N, rather than linearly with N.These examples show that X depends on both the population standard deviation, and the number of observations in our sample, N. The actual relationship is X = /sqrt(N). That is, the standard error of the mean is directly proportional to the population standard deviation, and inversely proportional to the square root of N, the number of observations in each sample. Usually we don't know the population standard deviation, , so instead we approximate by the sample standard deviation, s. Using s in place of gives us this formula to estimate the standard error of the mean: Standard Error of the Mean = (Sample standard deviation)/(Square root of N) = 𝑆𝑥̅ s = N We’ll use standard error in statistical tests such as t-tests and analysis of variance to compare groups. We also use standard error to determine if the slope of a regression line is non-zero. We'll use the standard error of a statistic (such as the standard error of the sample mean, or the standard error of coefficients in a regression model) to determine the statistical significance (p-values). Adding things up: Sigma ( notation The Greek symbol Sigma ( in a formula means to take the sum. Let’s look at calculating the mean number of flowers, using sigma notation. There were 5 plants, and we could assign each of them a label: Plant X1 X2 X3 X4 X5 Number of flowers 8 8 10 12 12 The letter X represents the variable, in this case plant number, and the subscripts 1 through 5 indicate which plant we are considering. We use the annotation Xi (X sub i) to indicate any individual plant without specifying which one. So, if i=2, then we are considering plant X2, with 8 flowers. To indicate that we are adding up the number of flowers in 5 plants, we could write as follows. Sum of number of flowers in 5 plants = 8 + 8 + 10 + 12 + 12. Or we could write: Sum of number of flowers in 5 plants = X1+ X2+ X3+ X4+ X5. It would get tedious to write out this formula for a lot of plants, so instead we use the Sigma ( notation: Sum of number of flowers in 5 plants 5 Xi i 1 = sum of Xi for i from1 to 5 = X1+ X2+ X3+ X4+ X5 = 8 + 8 + 10 + 12 + 12 = 50 Sometimes we won’t write out the subscript “i=1” or the superscript “5” if the meaning is clear. In that case, we might just write Xi . Finally, to calculate the mean of the number of flowers in 5 plants using sigma notation, we write the following. Mean of number of flowers in 5 plants = X 5 Xi i 1 5 = 50/5 = 10 flowers