IB Biology Topic 1 Statistical analysis Keywords Arithmetic mean Causal Relationship Significance t-test Value Topic 1: Statistical Analysis Correlation Spread Variability Error bar Standard deviation Variable 1.1.1 1.1.2 1.1.3 State that error bars are a graphical representation of the variability of data. Calculate the mean and standard deviation (SD) of a set of values. State that the term standard deviation is used to summarize the spread of values around the mean, and that 68% of the values fall within one standard deviation of the mean. 1.1.4 Explain how the standard deviation is useful for comparing the means and the spread of data between two or more samples. Describing variation mathematically: Living things can vary so that even two peas in a pod show a variety of sizes and shapes. This raises a number of questions. How can we describe the range of variation? Which pea size is the most common? Can we sort the peas into groups to decide if they came from the same or different pods? Biologists ask these types of questions not only about living organisms but also about sets of data from experiments. The Arithmetic Mean: A group of ten students were tested for shoe size. The results are listed here: Group A 5 6 8 7 8 6 7 7 9 7 The arithmetic mean is the total divided by the number of results, so: Total = 70 Number of results = 10 Mean 70 7 10 If you were a shoe manufacturer you would find this useful as you would know that it would be a good idea to make plenty of shoes at size 7. However, they do not know how wide the variation is around the mean. All of the distributions on the next page have means of 7, but they clearly need very different outputs from the shoe factory. Group A 5 6 8 7 8 6 7 7 9 9 Group B 7 7 7 7 7 7 7 7 7 7 Group C 5 6 6 6 7 7 7 7 8 9 Group D 5 5 5 5 8 8 8 8 9 9 Most people if asked to summarise each set of data above would probably come up with the idea of using the mean. If asked what other information would be useful, can you think of anything? You may have suggested that a measure of the ‘spread’ of data would be useful. A very simple way to do this is to simply record the range. Can you complete the information below to describe each set of data: Group Mean ‘Spread’ of Data A 7 Data ranges from 5 to 9 B 7 C 7 D 7 Can you see any problems in only using this to describe a group of data? Standard Deviation The standard deviation of a set of data is calculated by calculating the deviation of each measurement from the mean. Which group is very tightly packed around 7 – a shoe manufacturers dream?_____________ This group should have the smallest standard deviation. If data is clustered around the mean you would expect to have lots of small deviations away from the mean. If the data is more spread out you would expect the deviations to be bigger. Continuous data shows a smooth transition of values across a spectrum. So, weight, height and numbers of plants in a particular area are all good examples. To describe the spread of results in continuous data, biologists use a statistic called the standard deviation. Adapted with permission from Sha Tin College-1/12 IB Biology Topic 1: Statistical Analysis Calculating Standard Deviation (What are the steps involved and what does it mean?) The heights in the groups of students checked for shoe size were recorded. The data shows a typical distribution and the mean can easily be calculated. Heights 157 160 161 164 171 172 175 176 177 182 Work out the total and the mean for this set of data: The individual members of the group are different from the mean. These differences can be calculated. Height 157 160 161 164 171 172 175 176 177 182 Difference from mean Since some of the individuals fall below the average some of the differences will be negative. To convert all these values into positive numbers they are squared. Height 157 160 161 164 171 172 175 176 177 182 Squared differences The figures above give a measure of the deviation of the individuals from the mean. The standard deviation is the mean deviation. So you will need to find the mean of these values: Total = 642.5 (explain what numbers were used to calculate this) Mean = 64.25 (explain how this number was obtained) Since this is the mean of the squares of the original deviation we use the square root of the mean and call it standard deviation. Standard deviation √64.25 = 8.02 The standard deviation is a useful way to describe the variability in a set of continuous data. The larger the standard deviation the larger the spread of data is around the mean. Questions The table below shows the heights of two groups of IB Biology students. Group A heights / cm 180 176 160 169 172 178 182 177 175 Group B heights / cm 180 177 163 166 175 177 180 a. Calculate the mean for each set of students. b. 179 173 169 Calculate the standard deviation for each set of students. Heights Group / A Difference from mean Square of differences 180 176 160 Heights Group / B Difference from mean Square of differences 180 177 163 169 166 172 175 178 177 182 180 177 179 175 173 169 Adapted with permission from Sha Tin College-2/12 IB Biology Topic 1: Statistical Analysis Total of squared differences for group A: Total of squared differences for group B: Standard deviation for group A = √ = Standard deviation for group B = √ = The graph below shows three different groups of data: Using your Graphical Calculator to Calculate Standard Deviation 1. 2. 3. Turn the calculator on. Press the stat key and press enter. Input the data into L1. (use the data from side 6 about heights) 4. After all the data has been entered press the stat key again and shift across to select CALC and press enter. 5. Press the 2nd key followed by L1 and enter. This will give information about your set of data. Work out the following mean: _ _ X = 169.5 X= X = 1695 X= X2 = 287945 X2 = X = 8.01 X= N = 10 N= How Fast Are You? How fast are you? How do you compare to your classmates? Human reaction time can be measured using a ruler and the ‘Drop-Catch’ method with a partner. Procedure 1.Have your partner brace his or her writing hand on the edge of a desk or table, with the fingers and thumb extending over the edge. Hold the ruler above your partner’s hand so that the “0” line is level with the top of the thumb, as shown in the figure. The ruler should be able to slide easily between your partner’s thumb and index finger. 2.Drop the ruler so that it falls straight down between your partner’s thumb and index finger. Your partner should grab the ruler as quickly as possible. Read the number on ruler just above your partner’s thumb and index finger. This is the distance the ruler fell before your partner caught it. Record this number in your data table. 3. Repeat steps 1 and 2 four more times. You should have a total of five measurements in your data table. 4. Have your partner switch hands so that he or she is catching the ruler with the nonwriting hand. Repeat steps 1 through 3. You should now have a total of 10 measurements for your partner: 5 for the writing hand and five for the non-writing hand. Don’t forget to record them in the data table. 5.Switch places with your partner and repeat the whole exercise. Adapted with permission from Sha Tin College-3/12 IB Biology Topic 1: Statistical Analysis DATA TABLE – To Show Rates in Myself and My Partner When Catching a Ruler Your Partner You Writing hand Non-writing hand Writing hand Non-writing hand / distance /cm /distance /cm / distance / cm / distance /cm 1 2 3 4 5 Average Standard Deviation Questions: Which person had the faster reaction time with the writing hand, you or your partner? Was your average reaction time when you used your writing hand different from when you used your non-writing hand? Was your partners? What does the standard deviation tell us? The standard deviation as has been mentioned before is a measure of the variability of a set of data or, to be more precise of its spread around the mean. By definition about 68% of all values lie within the range of the mean plus or minus one standard deviation (i.e. X± 1s). About 95% of all values lie within the range of plus or minus 2 standard deviations (i.e. X ± 2s). From your data from ‘How Fast Are You?’ If you were to get 1000 more readings for your writing hand, theoretically……… Between what numbers would 68% of your values lie? Between what numbers would 95% of your values lie? Questions: 1. A fish farmer sells 10 000 trout in a year (mean mass = 400g and = 25g). a. Assuming normal distribution, estimate the number of these that would be in the range ± 1 s.d. b. The number that would have a mass greater than +1 s.d. 2. The pulse rates of 2400 patients were recorded and it was calculated that the mean value was 74 beats per minute with a s.d. of 6 beats per minute. What percentage of the patients had a pulse rate in the range 68-80 beats per minute? How do we show variability in our data when we graph it? When carrying out an experiment and you want accurate results you usually take repeat measurements and then take the mean of those measurements. Sometimes we need to be able to compare two measurements. But can we be sure that our two means are really different? Perhaps our data is not precise enough? One way to take into account the variability in the results and hence their level of accuracy is to draw error bars. A simple way to construct an error bar is to use the maximum deviation of a single data point away from the mean. When drawing a graph an error bar is drawn above and below the mean that shows the maximum deviation away from the mean. Error bars can be constructed for each mean value: Adapted with permission from Sha Tin College-4/12 IB Biology Topic 1: Statistical Analysis If the error bars overlap then it cannot be concluded that the values are truly different. In biology we state that the values are not significantly different. If the error bars do not overlap then a conclusion that they are significantly different is justified. Standard deviation error bars are more sophisticated indicator of the precision of a set of measurements. Standard deviation error bars are usually drawn for 1 standard deviation above and below the mean. If standard deviation is calculated for a set of data you will need a minimum of five repeats. Task Using your data from ‘How Fast Are You’ for both you and your partner plot a graph with standard deviation error bars and answer the questions on Soya beans: 1. Are the responses times of your writing and your non-writing hand significantly different? Explain the reason for your answer. 2. Are your reaction times for your writing hand significantly different from those of your partner? Yield of soya beans in plots at different altitudes in Zimbabwe. Yields are given as a mean and standard deviation Looking at the data, use the graph to describe how the yield varies with altitude. (Remember to make a general conclusion comparing the means and then use the error bars to discuss whether the data is precise enough to offer significant differences.) Adapted with permission from Sha Tin College-5/12 IB Biology Topic 1: Statistical Analysis If you can see this great………………….. If the difference in the means is less than the standard deviation of one or both samples then they will not be significantly different. If you drew this onto a graph the error bars would overlap!! 1.1.5 Deduce the significance of the difference between two sets of data using calculated values for t and the appropriate tables. The student t test is a statistical test. One of the most common applications of statistics is to compare two sets of data, for example the heights of males and females in a class. These heights can be represented as a frequency histogram using the same x axis for both sets of data. If almost all the male students were taller than the female students then the two histograms would show very little overlap, as shown below in graph (a). From looking at this graph we would be confident in saying that the male students are taller than the female students. Fig 1: Comparing two sets of data. The triangle indicates the mean value for each set of data. As the overlap increases it becomes less certain that there is a difference. If the data looked like that shown in graph (b) above where there is almost complete overlap, then we would be confident in saying that there is no difference in the height of male and female students. It may appear from the graphs above that the difference between the mean values should be a sufficient measure of overlap, i.e. as the means become closer the overlap increases. However, the overlap between the two sets of data also depends on how closely the data are clustered around the two means. Look at the two graphs below: You should notice that the difference between the means is the same. However, the data used to plot graph (b) is more variable there is more overlap, and less certainty that there is a difference between the data. The T test is a technique which will take into account the means as well as the amount of overlap between two sets of data and say how certain we are that there is a significant difference. Adapted with permission from Sha Tin College-6/12 IB Biology Topic 1: Statistical Analysis The t-Test Notation __ X1 is the mean value for data set 1 Vertical lines indicate that the positive difference between the means should be taken, irrespective of which is bigger S is the symbol for standard deviation n is the number of measurements collected What does the ** Note yout-Test will tell notus? be expected to remember this formula It provides a way of measuring the overlap between two sets of data. If two sets of data have widely separated means and small variances (the data is clustered around the mean) they will have little overlap and a big value of t, they can be shown to be significantly different. On the other hand if two sets of data have means that are close together and large variances (the data is spread from the mean) they will have a large overlap and a small value of t, they can NOT be shown to be significantly different. Adapted with permission from Sha Tin College-7/12 IB Biology Topic 1: Statistical Analysis A large value of t indicates little overlap and a significant difference. A small value of t indicates a lot of overlap and no significant difference. To judge whether the value of t is big or small you have to consult a table known as ‘A Table of Critical Values’. The value that should be looked at in the table depends on something known as ‘The Degrees of Freedom’. An example of a part of a ‘Table of Critical Values’ is shown below: Degrees of Freedom Significance levels p = 0.05 2.13 2.12 2.11 2.10 2.09 2.09 2.08 2.07 2.07 2.06 2.06 2.04 2.00 2.00 15 16 17 18 19 20 21 22 23 24 25 30 40 60 p = 0.01 2.94 2.92 2.90 2.88 2.86 2.85 2.83 2.82 2.81 2.80 2.80 2.75 2.70 2.66 To work out the degrees of freedom: Degrees of freedom = number of classes – 1 So if there were 21 individuals in each sample then the degrees of freedom would equal: Degrees of freedom = (21-1) + (21-1) Degrees of freedom = 40 Imagine carrying out a t test to compare two sets of data with 21 samples in each set and a value of t was calculated and t = 3.42. Looking at the table, the critical value at for t at the 0.05 level (Biologist usually always look at this level) and with 40 degrees of freedom is 2.00. This means the probability of getting a value of t at least as large or larger than 2.00 by chance is less than 0.05 (5%). So it is extremely unlikely that the difference in the two sets of data could have arisen by chance. Therefore the two sets of data are significantly different. In fact 3.42 is also bigger than the value at 0.01 (1%) which means that the probability of getting a value of t at least as large or larger than 2.70 by chance is less than 0.01 (1%). In investigations that will be analysed using statistical tests scientists usually make a null hypothesis. The null hypothesis usually states that there is no significant difference between two samples. If a value of t is greater than or equal to the critical value then the null hypothesis can be rejected and it can be stated that there is a significant difference. Questions Below is some data obtained from Open University Students, who measured the lengths of leaves in 3 day germinated wheat seedlings that had been given different treatments. Batch A were grown from normal seeds and batch B from seeds that had been subjected to gamma radiation. Normal, batch A Gamma irradiated batch B ___ X 10.9 2.3 mean leaf length / mm S Standard deviation/mm 3.97 1.52 n sample size 15 15 Calculate the value of t using the equation. Show your work Adapted with permission from Sha Tin College-8/12 IB Biology Topic 1: Statistical Analysis How many degrees of freedom are there for this test? Work out the value of t. State a null hypothesis for this experiment and state whether it can be accepted or rejected with reasons: A market gardener was testing the effectiveness of plastic plant pots over clay pots. He used seed from a pure inbred line – so all seeds were the same genotype. He grew 10 plants in plastic pots and 10 plants in clay pots and observed how long it took before each reached a flowering stage suitable for sale. Below are the results: A – Clay B - Plastic Number (n) 10 10 Mean of time/days 95 100 Standard deviation (S) 3.2 4.6 Adapted with permission from Sha Tin College-9/12 IB Biology Topic 1: Statistical Analysis State a null hypothesis: Calculate a value of t and compare it with the values in the table on the previous page at the 5% probability level and the correct degrees of freedom. Make a comment about whether you would reject or accept the null hypothesis: Using excel to calculate values of t. yield of potatoes (kg) Plot Fertiliser A 1 2 3 4 5 6 7 8 9 10 mean t-test P st dev 27 20 16 18 22 19 23 21 17 19 20.2 7.88% 3.22 Fertiliser B 28 19 18 21 24 20 25 27 29 21 23.2 3.94 =TTEST (B3:B12, C3:C12, 2, 2) Also format cell for % =AVERAGE(C3:C12) ) =STDEV(B3:B12) On excel the t-test function is given by =TTEST(range1, range 2, tails, type) In Biology assume a two tailed test. For type, type 1 refers to comparing data from the same individuals and type 2 when data is compared between different individuals. Example a different set of potatoes was compared with fertilizer A and B therefore it is a two tailed test. If you compared the mean heart rates of all the members of your class before and after they had drunk a cup of coffee it would be a one tailed test because you are looking at differences in the same population. What is good about using Excel? When excel calculates t it gives you a percentage. You do not have to consult a table of critical values. The percentage tells us the probability that these two sets of data could be different due to chance. Remember Biologists work to a 5% rule generally, so there has to be a 5% or less chance that these two sets of data could be different due to chance before a Biologist states that the two sets of data are significantly different! In the above example the yield of potatoes one treated with fertilizer A and the other treated with fertilizer B. It can be seen that fertilizer B delivers a larger mean yield, but the t-test P shows that there is an 8% probability that these two sets of data are not really different. Since this is more than 5% we must conclude that fertilizer B is not significantly different. Adapted with permission from Sha Tin College-10/12 IB Biology Topic 1: Statistical Analysis An experiment was done to measure the pulse rates on 8 individuals before and after a large meal. The data is shown below: Subject 1 2 3 4 5 6 7 8 Pulse rate (bpm) Before eating After eating 105 109 79 87 79 86 103 109 87 90 74 78 73 78 82 89 Using excel perform a t-test and plot a graph. Check your results with your teacher and then print out and keep your results. Myth: Girls Can’t Catch Is there really a difference between boy’s and girls’ catching abilities? State a null hypothesis for this: Design a method for collecting data that can be tested using the t-test. Independent variable: Dependent Variable: Control variables: Method Collect your results and analyse them using a t-test. Can you bust the myth? 1.1.6 Explain the existence of a correlation does not establish that there is a causal relationship between two variables. Correlation is a statistical method that answers the question ‘Are these two variables associated?’. In other words, if one variable changes does the other changes too? All living organisms respire and most need oxygen to do this. There are many factors which affect the rate of oxygen consumption. One of these is temperature. Different organisms consume different volumes of oxygen at different temperatures. Biologists studying this will want to know if there is an association between oxygen consumption and temperature. The following graphs show scatter diagrams for an insect, the Colorado beetle, and a chipmunk which is a small, squirrel-like mammal. Adapted with permission from Sha Tin College-11/12 IB Biology Topic 1: Statistical Analysis Two different types of association are shown here. With the Colorado beetle, there is a positive association. In other words, as the temperature increases, so does the rate of respiration. A line of best fit slopes upwards. The scatter graph for the chipmunk, on the other hand, shows, a negative association. As the temperature increases, oxygen consumption decreases the line of best fit slopes downwards. If there is no association then when a scatter graph is plotted the points will be distributed randomly over the graph and it would be extremely difficult to draw a line of best fit. Causal Relationships or Not? Although drawing a scatter graph can enable you to see of there is a relationship between variables, it does not prove that ‘x causes y’. The graph below shows a positive relationship between ice-cream sales and cases of sunburn. The greater the number of ice-creams sold the greater the number of sunburn cases. It may seem obvious that ice cream does not cause sunburn but as scientists we have to be aware that a relationship between two variables does not mean that one thing causes the other. Some examples of questionable correlations: Since the 1950s, both the atmospheric CO2 level and crime levels have increased sharply. Hence, atmospheric CO2 causes crime. The above example arguably makes the mistake of prematurely concluding a causal relationship where the relationship between the variables, if any, is so complex it may be labeled coincidental. The two events have no simple relationship to each other beside the fact that they are occurring at the same time. scientific research finds that people who use cannabis (A) have a higher prevalence of psychiatric disorders compared to those who do not (B). This particular correlation is sometimes used to support the theory that the use of cannabis causes a psychiatric disorder (A is the cause of B). Although this may be possible, we cannot automatically discern a cause and effect relationship from research that has only determined people who use cannabis are more likely to develop a psychiatric disorder. From the same research, it can also be the case that (1.) having the predisposition for a psychiatric disorder causes these individuals to use cannabis (B causes A), OR (2.) it may be the case that in the above study some unknown third factor (e.g., poverty) is the actual cause for there being found a higher number of people (compared to the general public) who both use cannabis and who have been diagnosed as having a psychiatric disorder. Alternatively, it may be that the effects of cannabis are found more pleasurable by persons with certain psychiatric disorders. To assume that A causes B is tempting, but further scientific investigation of the type that can isolate extraneous variables is needed when research has only determined a statistical correlation. http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation Adapted with permission from Sha Tin College-12/12