Statistics for AP Biology Populations of organisms vary in almost every characteristic (e.g., size, weight, strength, intelligence, growth rate, response to stimulus, longevity, etc.) Statistics allow scientists to summarize characteristics of large populations quantitatively by making measurements on a small portion of the individuals (i.e., a sample) of that population. Using statistics we can also determine whether observed differences between characteristics of two populations are “statistically different.” In order to make inferences about entire populations, samples must be taken at random (without bias). The size of the samples and the amount of variability in the measurements taken will affect our ability to describe or to make predictions about the populations. Descriptive Statistics A parameter is a measure that describes or characterizes an entire population of data. It is typically not possible to collect all data in a population, and therefore, we must estimate parameters by computing statistics. Thus, a statistic describes or characterizes a sample of the population. We begin describing characteristics of populations by using measures of central tendency. Three Measures of Central Tendency Mean: The most commonly used parameter describing the central tendency of a population is the mean (sometimes called the average). The mean of the entire population is represented by the Greek symbol (pronounced mew). The sample mean that estimates the population mean is represented by X (pronounced x-bar). The sample mean is calculated as follows: X = ΣX/n (Equation 5.1) where X represents an individual datum (e.g., tree height, weight, etc.) and n is the number of individuals in the sample. As long as the sample is a random sample of the population, we may assert that the sample mean is a good estimate of the population mean. Just how precise will be shown later. Median: Another parameter describing the central tendency of a population is the sample median. The median is the middle measurement in a ranked listing of the data (i.e., there are an equal number of individuals with values greater than the median and of individuals with values less than the median). If there is an even number of individuals in the sample, the median is the mean of the middle two measurements. Eleven is the median of Sample 1 and 11.5 is the median of Sample 2 below. Sample 1 Sample 2 5,5,6,7,7,7,8,8,10,11,12,13,15,15,15,16,16,17,18 5,5,6,7,7,7,8,8,10,11,12,13,15,15,15,16,16,17,18,18 1 Variability: Calculating the mean or central tendency of the set of data only provides a partial description of the data. For example, look at the following two data sets. Each of the population samples have the same mean, however, the data of Sample 1 are much more variable than the data of Sample 2. Sample 1 9 22 16 13 11 19 Mean = 15 Sample 2 15 16 17 15 13 14 Mean = 15 Another way to visualize variability is using histograms. Data from samples 1 and 2 are plotted below using histograms. Note the dispersion of measurements around the mean (15). 3 3 Sample 1 Sample 2 2 Frequency Frequency 2 1 1 0 0 8 9 10 11 12 13 14 15 16 17 18 19 20 Measurement 21 22 23 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Measurement Range: One measure of data variability is the range or the difference between the largest and smallest measurement (e.g., range = {9,22} for Sample 1 and {13,17} for Sample 2). However, sample range almost always underestimates actual population range and does an incomplete job of describing variability. Sum of Squares: More applicable in statistical analysis is the measure of data variability based on the deviation from the sample mean. We define this quantity as the sum of squared deviations from the mean or sum of squares (abbreviated SS). The sum of squares is calculated as follows: SS = Σ(X – X )2 (Equation 5.2) thus, for the data in Sample 1, 2 SS = (9 – 15)2 + (22 – 15)2 + (16 – 15)2 + (13 – 15)2 + (11 – 15)2 + (19 – 15)2 = 122 Sample Variance: Sum of squares can be used to calculate a better measure of dispersion or variability called the sample variance (abbreviated s2): s2 = SS/DF (Equation 5.3) where DF is the degrees of freedom, defined as DF = n – 1. Therefore, for the data from Sample 1, DF = 6 – 1 = 5. Thus, s2 = 122/5 = 24.4. The sample variance is a good estimate of the population variance (σ2). In other words, s is a good estimate of the variance we would compute if we were able to sample the entire population of interest. 2 Standard Deviation: Rather than the sample variance, we usually report the standard deviation (s or SD), (Equation 5.4) s or SD = s 2 or For Sample 1, SD = 24.4 4.94 = The sample mean and standard deviation are often used to describe and compare characteristics of populations. We report this information as the sample mean ± the standard deviation ( X ± SD). For Sample 1, we would report: X = 15 ± 4.94. The standard deviation is a measure of the dispersion, or scatter, of the data. For instance, if a surgeon collects data for 20 patients with soft tissue sarcoma and the average tumor size in the sample is 7.4 cm, the average does not provide a good idea of the individual sizes in the sample. It could be that the sizes in the sample are similar and lie between 7 and 9 cm or that the sizes are dissimilar with some tumors being very small and others very large. In the former case, size likely will play little role in the differences in outcome between patients, whereas in the latter case tumor size could be an important factor (confounding variable) explaining differences in outcome between patients or relating to other variables such as surgical margins. Further, having an estimate of the scatter of the data is useful when comparing different studies, as even with similar averages, samples may differ greatly. It therefore is important to report the variability in the sample and this is done with the standard deviation of the sample. 3 Statistics Practice 1-Using the data from Table 5.1, make a table that lists the population name (cave) and the sample statistics for each population including mean number of parasites, median, DF, SS, s2, and SD for each of the populations. Next, try your hand at creating a histogram. Table 5.1 (Sample Data for practice). The number of parasites on randomly selected bats located in 4 caves in the Southern United States during the summer of 2013. Coronado cave Manitou cave Onyx cave Crystal cave 3 6 3 11 2 5 5 10 6 11 3 7 7 3 2 6 5 9 1 7 2 8 1 5 9 6 4 2 1 28 7 1 4 Estimating Population Means When we estimate a population mean from a sample mean, we may wonder how close the estimate of the population mean is to the actual population mean. This is answered by considering the fact that repeated samples from the same population will likely have somewhat different means. The variability of these possible sample means is known as the standard error (SE) of the mean and is calculated as follows: SE = s / n (Equation 6.1) Using this equation, the standard error for Sample 1 is SE = 4.94 / 6 = 2.02 It is important to differentiate between SD and SE. The SD is an estimate of the variance in the actual characteristic measurements. In contrast, the SE is a measure of the variance among estimates of the sample means that can be calculated from a single sample. 4 We can use the standard error to calculate a confidence interval (CI), a range of values that, with a stated level of confidence, will include that actual population mean, µ. There are many statistics that can be used to calculate a confidence interval. One of the simplest and most common statistical tests uses a statistical distribution known as the Student’s t (see the attached table 6.2). In this table, DF is the degrees of freedom (n – 1) and α is the significance level. We can calculate a confidence interval using the following equation. (1 - α ) CI for µ = X ± t(SE) (Equation 6.2) where t is the value obtained from Table 6.2 and SE is the standard error. A significance level of 5% is the most frequently used level in biological sciences. Using α = 0.05 allows us to compute a 95% (1 - α) confidence interval for Sample 1 as follows: t(DF=5, α=0.05) 95% CI = 2.57 = 15 ± 2.57(2.02) = 15 ± 5.19. Interpreting confidence intervals A 95% confidence interval may be interpreted in the following manner for Sample 1: 95% of the time (in 95 experiments out of 100), µ, the actual (unknowable) mean of the entire population, will fall between 9.81 (15 – 5.19) and 20.19 (15 + 5.19). For this exercise, we will assume that when comparing population means, if confidence intervals do NOT overlap, the means are significantly different. See Figure 6.1 on p. 4 for a good way to display sample means and their confidence intervals. This demonstration shows us two important relationships between the measurements we make and the size of the sample we choose to measure. Measurements with lower variability (smaller SD) provide us with a narrower confidence interval within which µ is expected. Larger sample sizes (n) also lead to narrower confidence intervals because the denominator in our calculation of SE will be larger. Note that you can’t control the variability of measurements because that is inherent in whatever you measure, however, you can control your sample size and it is almost always important to make it as large as possible. 5 More statistics practice Using the data from Table 6.1 (below), make a table that lists the population name (Lake) and the sample statistics for each population including mean ducklings fledged, median, DF, SS, s2, SD, SE, and t95(SE) for each of the populations. In addition, make a bar graph that illustrates the mean with the 95%CI represented as “error bars.” Are there any significant differences between the lakes? Table 6.1 (Sample Data for homework). The number of ducklings that fledged (survived) from randomly located nests at 5 prairie pothole lakes in southern Minnesota during the summer of 2009. Swan Lake 4 5 7 7 2 3 4 11 Eagle Lake 6 7 7 6 8 12 3 11 9 Minn. Lake 2 1 0 4 1 0 1 2 3 Loon Lake 11 13 11 17 12 10 14 Middle Lake 3 3 5 4 4 6 3 5 4 6