Statistics – A Survival Manual Part 1 - Measures Of Central Tendency "What's the price of bacon?" "How is the weather in Southern California this time of year?" "How tall does the Douglas fir grow?" These are examples of questions that are really looking for a figure that would be representative of your total population, that would give a fairly good general impression, a measure of central tendency. Some form of "average" is generally used to provide the answer. The mean and median are the most common measures of central tendency. Mean The mean, occasionally called the arithmetic mean, is the most common measure of central tendency. This measure of central tendency allows the value (magnitude) of each score in a sample to have an influence on the final measure. To obtain a mean, all of the observations (X) in a sample are added together, then that sum is divided by the total number (n) of observations in the sample. Mean = X/n Median The median is a second measure of central tendency. Unless you have a particularly large group of numbers, this is a rather simple measure. To determine the median, sort your data so the observations either increase or decrease in value. The median is then simply the middle observation in your data set. If the number of observations in your data set is odd, the middle observation can be found by dividing the total number of observations by two, then rounding the resulting number up to the nearest whole number. For example, if you have 25 observations in your data set, the median would be obtained by dividing 25 by 2, giving 12.5. Rounding up to the nearest whole number gives 13, making observation 13 in your ordered data set the median value for your sample. If the number of observations in the data set is even, the median is obtained by averaging the values for the middle two observations after the data set has been ordered. For example, in a data set with 24 values, the middle two observations would be observations 12 and 13. If the value for observation 12 was " 100" and the value for observation 13 was " 102", the median for the data set would be "101 ". Part 2 - Measures Of Variability Or Dispersion Means and medians are expressions of central tendency. They provide a general picture of what a group of numbers is like, however, measures of central tendency can be misleading. For example, let us say the average height of two 1 basketball teams is 180 cm. On team A, all of the players are exactly 180 cm, but on team B, one is 180 cm, two are 160 cm and two are 200 cm. Knowledge of this variation, or dispersion from the mean would be meaningful to the coach of team A. A number of measures of dispersion are in common use; the range and the interquartile range (used in conjunction with the median) and the variance, standard deviation and coefficient of variance (used in conjunction with the mean) being the most common. The Range When the nature of the distribution is not known, or when the distribution is known to be other than normal (e.g., skewed), the range will give a rough idea of the dispersion. The range is the difference between the high and low scores. The range does not tell us anything about the nature of the distribution, only its extent. If our sample observations were 18, 17, 17, 12, 11, 11, the range would be 7, (18 - 11 = 7). The Interguartile Range Quartiles, like the median, represent points in an ordered set of observations. The first quartile (Ql) is a point below which 25% of the observations fall; 50% of the cases fall below Q2 and 75% below Q3. The interquartile range includes those 50% of the scores that fall between Q I and Q3. When the distribution of numbers is not normal because of the extremes on either or both ends, an inspection of the middle 50% may prove to be most revealing in describing a group of numbers. Inspect the following set of ordered data. In this data set, the first quartile includes observations 1 through 5, the second quartile includes observations 6 through 10 the third quartile includes observations 11 through 15, and the interquartile range extends from 17 to 61. Obs. # Value Obs. # Value 1 2 3 4 5 6 7 8 9 10 10 10 11 13 15 17 21 23 26 30 11 12 13 14 15 16 17 18 19 20 35 39 42 50 61 73 95 102 113 140 The Variance Another group of measures of dispersion is based on the distance of each measurement from the center of the distribution (i.e. the mean or median). The 2 first of these is the variance. To obtain the variance of a set of numbers, follow these steps: 1. 2. 3. 4. 5. Subtract each observation from the measure of central tendency (usually the mean) to obtain the "deviation" of each observation. Square each of these deviations. Add together all of your squared deviations. Subtract 1 from the total number of observations in your sample (n- 1) Divide your "sum of squared deviations" by this number. Bingo! You've got the variance of your data set. Just as a side note, dividing by "n1 " gives what is technically know as the sample variance. The Standard Deviation With a normal distribution, the mean is the most accurate and descriptive measure of central tendency because it considers the magnitude (or value) of each score. In like manner the standard deviation considers the magnitude of each score and therefore is the preferred measure of dispersion if the distribution is normal. Although the variance is a useful starting point in describing in the variability of a normal data set, the standard deviation is more commonly used (for reasons we will not get into now). Once you have the variance of a set of scores, getting the standard deviation is simple. All you have to do is take the square root of the variance. Coefficient of Variance The final measure of dispersion is the coefficient of variation (CV for short). The CV is a useful measure of dispersion because it allows us to compare the magnitude of variation between two sets of data when the means (and the ranges) differ significantly. To obtain the CV, simply divide the standard deviation by the mean, and multiply the result by 100. Part 3 - A Test Of Location Perhaps one of the most basic questions we ask in science is "Do these two [fill in the blank with your favorite research topic] differ?" In most cases, we begin answering that question by using a statistical test that looks for differences in the location of the measure of central tendency for the two "things". The most commonly used test for differences in central tendency between two samples is the t-test. The t-test should only be applied to independent samples. This means the selection of a member of one group has no influence on the selection of any member of the second group. If you do not understand what this means, you need to ask me 3 about it, because this concept of independence between or among samples is a very basic assumption in many of the statistics commonly used in the biological sciences. The t-test incorporates not only a measure of central tendency (the mean) for the samples, it also incorporates measures of dispersion (the standard deviation) and the total number of observations in each sample (the sample size). Our ability to detect a true difference between two samples depends not just on the absolute difference between the means, but on the amount of dispersion and the size of our sample. In general, for the same difference between means, as variability decreases, our ability to detect a statistically significant difference increases. Conversely, as sample size increases , our ability to detect a difference increases. Once our arithmetic manipulations are completed, the result is applied to a table of values (often called a "values of t" table, see the attached sheet). The table has several columns of values or levels of significance. The researcher must decide how much flexibility can be allowed. This will depend on the degree of accuracy of the measuring instrument. Most scientific research sets the significance level at .05. This means there is only a 5 percent chance an error has been made in saying the means of the two groups were not exactly alike. The t-test The formula for the t-test was developed by W.S. Gossett in the early 1900's. As a brewery employee he developed a system of sampling the brew as a periodic check on quality. He published his works under the name Student which explains why you hear this formula called the Student t from time to time. In order to proceed with a t test, the "standard error of the difference" between the means of the two groups must be determined. Just as the name implies, this is a measure of variability in our estimate of the difference between the two means. To illustrate the t-test, look at the following table. The numbers represent average fish lengths from an urban and rural stream. Mean Length Urban Stream Rural Stream 80 84 Standard Deviation 9.16 7.83 Sample Size 30 30 At first glance this would seem to suggest fish in the rural steam are larger. But, we must ask the question, is the 4 mm difference a chance difference or does it really represent a true difference. If we were to take another sample from each stream and do it again, would we get the same results? To answer this question, you must first compute the standard error of the difference (sdiff). The following formula is appropriate for independent samples, if the distributions are normal. Once you compute the standard error of the 4 difference you can determine if a true (statistically significant) difference exists with the aid of the t-test and tables or an appropriate computer program. If we plug our numbers in from the data table, we find that the "Standard Error of the Difference" for our 4 mm length difference is 2.29 mm. The next step in the process is to use our value for sdiff to calculate a t-score. The formula for the tscore is given on the next page. As mentioned above, in order to determine if there is a true difference between the means, you need a table of values of t. You also need to know the degrees of freedom used in the problem. Degrees of freedom are determined by adding the number of scores in each group and subtracting 2, [n1 + n2 - 2]. Looking at the attached sheet, the left hand column is labeled df (for degree of freedom), Look down that column until you find the appropriate df for your problem and read across to the column labeled .05 for two tailed tests. If the figure you found in the t-test is larger than the .05 figure there is a true difference between the means. For the example above, looking down the column for degrees of freedom we pick row 50. The degrees of freedom for this problem would be 58, (n, + n2 - 2) but in that there is no row 58 we would drop to row 50, the more conservative alternative. The two tailed values in this row are .05 = 2.009, .01 = 2.678 and .001 = 3.496. Our t value was 1.746 which is smaller than 2.009 so we can say there is not a significant difference between the means at the .05 level or 5% level. It should be noted that this only tells us if there is a significant difference, not the direction of any difference; to conclude the latter requires more than the intent of this survival manual. Formula to Calculate the Standard Error of the Difference 5 Formula to Calculate t-scores Part 4 - Correlation Even though we often think of statistical techniques as a way of telling if two things, groups or sets of data are different, we can also use statistical analysis to ask if two things are related. For example, we might want to know whether the rate of growth in a bacterial colony is related to temperature. Alternatively, we might be interested in the relationship between the density of two types of plants that use similar types of resources. These and many other questions can be tested using a statistical process called correlation. There are several different formulas to determine correlation. The most frequently used is the Pearson product moment correlation, named after Karl Pearson. This procedure requires that the numbers be at least interval in nature. It also requires that the data be paired. We can use the bacterial colony question as an example of what is meant by paired data. Each item of the pair must have a common denominator. In this instance, the number of colonies counted and the temperature both come from the same petri dish. The temperature in a given dish is one of the paired numbers and the number of bacterial colonies is the other member of the pair. The pair of scores for dish #1 (see the next table) are 100 (Number of colonies) and 85 (temperature). 6 Dish Number Number of Colonies Temperature 1 2 3 4 5 6 100 70 85 75 65 60 42 34 34 32 30 29 From the data above it can easily be seen that the warmer the dish, the greater the number of colonies. Knowing this may be adequate for your purposes, or you may wish to have a more precise measure of the degree of relationship. You may even be interested in graphically showing this phenomenon. Scattergram The graphical procedure to show correlation is called a scattergram. To develop a scattergram you first place one set of the numbers in a column in descending order and label it X. In our example of temperature and bacterial growth, we can label the number of colonies as the X column. The column of the paired figures (temperature) is listed next to the X column and labeled the Y column. Do not break up the pair. Except in the instance of a perfect positive correlation the Y column will not be in perfect descending order. On graph paper, a horizontal scale is marked off for the X column near the bottom of the page. The values increase from left to right. A vertical scale is drawn to intersect one line to the left of the lowest X value. The vertical scale represents the temperature on the Y column. The horizontal scale is the X axis and the vertical scale the Y axis. The length of the scale on the X axis should be about equal to the length of the scale on the Y axis. There is no rule that requires this, however the procedure helps space the scattergram which makes it easier to interpret. Each pair of figures is plotted on the graph. A dot is placed where the X and Y values intersect for each pair of figures. The dots are not connected. Graph the data presented above on a sheet of graph paper. If you have done this correctly, the dots should form nearly a straight line from lower left to upper right. This configuration would be considered a positive correlation, and the more nearly the dots form a straight line the stronger the relationship between the two variable. If the dots had formed a pattern from upper left to lower right, the correlation would have been regarded as negative; that is to say there was indeed a relationship, but the conclusion would be that the variables plotted had an inverse affect on each other. If we had continued to increase the temperature above 50, we might have seen a decline in the number of colonies as we exceeded the 7 temperature tolerance of the bacterial species we were testing. If we had only graphed the response to temperatures above 50, we might have observed a negative relationship. If the dots appeared randomly over the scattergam, you would interpret this as no relationship between the X and Y column measurements. Computation As mentioned above, there are several formulas for determining correlation. At this point we shall only be concerned with the raw score method. The formula is: Correlation coefficients should always fall between (+1.0) and (-1.0); it can not be anything else. The closer to 1.0, the more perfect the relationship is between the two sets of numbers in a positive way, that is, when one figure goes up, the corresponding paired number also goes up and in like manner when one of the paired numbers goes down, so will the other. The negative correlation supports the opposite conclusion, when the first number goes up, its pair goes down or vice versa. The closer to zero, the less likelihood of any relationship. Part 5 – Contingency Table Analysis Introduction The types of statistics you have seen so far allow you to describe measures of central tendency and dispersion (means, medians, ranges, standard deviations, 8 etc.), test for differences among populations (the t-test), or look for relationships between two measures (correlation). One additional group of statistical techniques that is often useful in ecological studies are tests that allow us to compare the frequency of observations in various categories between two different sets of observations. Our comparisons may be between a set of real world observations and some expected distribution of observations, or we may want to compare frequency of counts between two sets of real world observations drawn from different areas or populations. You have already encountered the first type of comparison in your introductory biology courses when you compared observed phenotypes in a population to those that would be expected if the population were in HardyWeinberg equilibrium. This general class of tests is known as a "goodness-of-fit" test since we are attempting to determine if our observations "fit" our expectations. The most common goodness-of-fit test is the chi-square goodnessof-fit test. The second type of comparison is more common in ecological field studies. In this case, we usually have two variables that we have used to classify our observations. Each variable has two or more categories. Based on these categories we can set up an table, with n columns (where n = the number of categories for the first variable) and p rows (where p = the number of categories for the second variable). Each observation can then be assigned to one of the cells in our table. The question we hope to answer is whether the value for the column variable has any influence on the values of the row variables. Statistically, we are asking the following question: "Is the value of our second variable independent of the value for our first variable?" This type of test is known as a chi-square contingency table analysis. The null hypothesis for contingency table tests is "The effect of our first variable is not influenced by our second variable." If this seems a little fuzzy, hang on while we run through an example. If it still seems fuzzy after that, come and talk to me about it! Frequency Distributions Before we get too far along on the contingency table test, we should take a moment to understand the idea of frequency distributions. Let's start with a simple idea that everybody can grasp, gender. Assume we have a class that has 50 students in it. We can classify each person in the class as male or female. If the class is split evenly according to gender, then the frequency of males in the class is 25 and the frequency of females is 25. Gender is easy because it is what is commonly called a discrete or "nominal scale" variable. There is little doubt how many categories we have. 9 But what if we wanted to categorize the class not by gender, but by height? Height is a "continuous" variable, meaning that if we had a measuring device that was accurate enough, there are an infinite number of categories into which we could place the members of our class. What we normally do with continuous data when we want to determine frequencies is establish intervals of equal "width" or magnitude, and assign our observations to those categories. In fact, we do this as a matter of course every day. For example, we might hear a person claim that they are 175 centimeters tall (or we would if we used metrics like the rest of the world), but if we were to measure their true height, we would find that in fact they were actually 174.67 cm in height. The "width" of our categories depends both on the number of observations we have and the range of our data. Contingency Table Analysis To illustrate the idea of contingency table analysis, let's continue with the same example. We can ask the question, does the gender of an individual have an effect on their height? As a null hypothesis, we could state this question as "The gender of an individual has no effect on their height". Assume that we measure the height and record the gender of all the members of our hypothetical 50 member class. The raw data is given in the table below. Height 150 151 152 152 153 154 155 155 156 157 Gende r F M M M F F F M M M Height 158 158 159 160 161 161 162 163 164 164 Gende r F F F M M M F F F F Height 165 166 167 167 168 169 170 170 171 172 Gende r M M F M F F F F F M Height 173 173 174 175 176 176 177 178 179 179 Gende r F M F M M F F M M F Height Gende r M M M M M F F F M M 180 181 182 182 183 184 185 185 186 187 When setting up a contingency table, we want to have at least 5 observation in 80% of our cells (one of those statistical rules that you should just take as a given), so when we are deciding about the "width" of our intervals, we should keep this in mind. For the data above, an interval width of 10 cm works pretty well. That gives us four classes, individuals less than or equal to 160 cm, individuals from 161170 cm, from 171-180 cm, and over 180 cm. We can then assign all of our observations to one of the cells of a 2 4 contingency table, where the columns represent gender and the rows represent the height interval. 10 Height Interval (cm) Gender Male Female < = 160 161 - 170 171 - 180 > 180 7 5 7 6 7 9 6 3 This table represents our observed frequencies. What do we compare it to? We need to create a second table that represents what our data would look like if gender and height were totally independent of one another. To do this, we use our observations to estimate what our cell frequencies would be if this were true. The first step in this process is get total counts for each row, column, and for the overall table, as illustrated below. < = 160 161 - 170 171 - 180 > 180 Column Total Male Female Row Total 7 5 7 6 25 7 9 6 3 25 14 14 13 9 Grand Total 50 The next step is to use the row total, the column total, and the grand total to determine the expected value for each cell in the table. To do this, you multiply the column total by the row total for each cell, then divide that number by the grand total. The table below illustrates the process. < = 160 161 - 170 171 - 180 > 180 Column Total Male Female Row Total (14*25)/50 = 7 (14*25)/50 = 7 (13*25)/50 = 6.5 (9*25)/50 = 4.5 25 (14*25)/50 = 7 (14*25)/50 = 7 (13*25)/50 = 6.5 (9*25)/50 = 4.5 25 14 14 13 9 Grand Total 50 Finally, now that we have our expected distribution, we can compare that to our observations to determine if there is a difference. To accomplish this, we begin by subtracting our expected values from our observed values for each cell in our table. We square that number, then divide by our expected value. 11 Step 1 Male < = 160 161 - 170 171 - 180 > 180 7-7 = 0 5-7 = -2 7-6.5 = 0.5 6-4.5 = 1.5 Step 2 Female 7-7 = 0 9-7 = 2 6-6.5 = -0.5 3-4.5 = -1.5 Male Female (0) /7 = 0.00 (-2)2/7 = 0.57 (0.5)2/6.5 = 0.04 (1.5)2/4.5 = 0.50 (0)2/7 = 0.00 (2)2/7 = 0.57 (-0.5)2/6.5 = 0.04 (-1.5)2/4.5 = 0.50 2 Once we have the values from step 2, we sum these values to obtain our chi-square value. In this case, our calculated value is 2.22. We then take this value to our handy chi-square table (attached to this handout) to determine if we should reject or fail to reject our null hypothesis. To get your p-value, you need to know (as always) the degrees of freedom you have. In contingency tables, your degrees of freedom are equal to (number of rows - 1) (number of columns -1) = (4-1) (2-1) = 3. Looking at our table, we find that with 3 degrees of freedom, the chi-square value at the 0.05 level of significance is 7.815. To reject our null hypothesis, our calculated chi-square value must be greater than 7.815. Since our calculated value is less than this, we cannot reject our null hypothesis. Based on this data, the height of an individual does not depend on his or her gender. This does not surprise me since I used a random number generator to assign heights to males and females! Practice, practice, practice! To be sure that you can do a contingency table analysis, play with the data set below on your own time. The data represent the color patterns of a species of tiger beetle during different times of the year. Your null hypothesis is that season does not affect color. If you carry out the test correctly, you should obtain a chisquare value near (given rounding errors) 27.7. Color Season Bright Red Pale Red Early Spring Late Spring Early Summer Late Summer 29 273 8 64 11 191 31 64 12 The information given in this survival manual is drawn from the following three sources: Ott, L, R. F. Larson, and W. Mendenhall. 1983. Statistics: A tool for the social sciences. Roys, K. B. 1989. Research and Statistics: An introduction Sokal, R. R. and F. J. Rohlf 1981. Biometry Additional references that are useful for sampling and statistical information are listed below. Cochran, W. G. 1977. Sampling techniques. Green, R. H. 1979. Sampling design and statistical methods.for environmental biologists. Krebs, C. J. 1989. Ecological Methodology. Southwood, T. R. E. 1979. Ecological methods. 13