Topic 1 Statistical analysis

advertisement
IB Biology
Topic 1 Statistical analysis
Keywords
Arithmetic mean
Causal
Relationship
Significance
t-test
Value
Topic 1: Statistical Analysis
Correlation
Spread
Variability
Error bar
Standard deviation
Variable
1.1.1
1.1.2
1.1.3
State that error bars are a graphical representation of the variability of data.
Calculate the mean and standard deviation (SD) of a set of values.
State that the term standard deviation is used to summarize the spread of values
around the mean, and that 68% of the values fall within one standard deviation of the mean.
1.1.4
Explain how the standard deviation is useful for comparing the means and the spread of data between two or
more samples.
Describing variation mathematically:
Living things can vary so that even two peas in a pod show a variety of sizes and shapes. This raises a number
of questions. How can we describe the range of variation? Which pea size is the most common? Can we sort
the peas into groups to decide if they came from the same or different pods? Biologists ask these types of
questions not only about living organisms but also about sets of data from experiments.
The Arithmetic Mean:
A group of ten students were tested for shoe size. The results are listed here:
Group A 5
6
8
7
8
6
7
7
9
7
The arithmetic mean is the total divided by the number of results, so:
Total = 70
Number of results = 10
Mean 
70
7
10
If you were a shoe manufacturer you would find this useful as you would
know that it would be a good idea to make plenty of shoes at size 7.
However, they do not know how wide the variation is around the mean.
All of the distributions on the next page have means of 7, but they clearly
need very different outputs from the shoe factory.
Group A
5
6
8
7
8
6
7
7
9
9
Group B
7
7
7
7
7
7
7
7
7
7
Group C
5
6
6
6
7
7
7
7
8
9
Group D
5
5
5
5
8
8
8
8
9
9
Most people if asked to summarise each set of data above would probably come up with the idea of using the mean. If asked
what other information would be useful, can you think of anything?
You may have suggested that a measure of the ‘spread’ of data would be useful. A very simple way to do this is to simply
record the range. Can you complete the information below to describe each set of data:
Group
Mean
‘Spread’ of Data
A
7
Data ranges from 5 to 9
B
7
C
7
D
7
Can you see any problems in only using this to describe a group of data?
Standard Deviation
 The standard deviation of a set of data is calculated by calculating the deviation of each measurement from the mean.
 Which group is very tightly packed around 7 – a shoe manufacturers dream?_____________
 This group should have the smallest standard deviation.
 If data is clustered around the mean you would expect to have lots of small deviations away from the mean. If the data
is more spread out you would expect the deviations to be bigger.
 Continuous data shows a smooth transition of values across a spectrum. So, weight, height and numbers of plants in a
particular area are all good examples. To describe the spread of results in continuous data, biologists use a statistic
called the standard deviation.
Adapted with permission
from Sha Tin College-1/12
IB Biology
Topic 1: Statistical Analysis
Calculating Standard Deviation (What are the steps involved and what does it mean?)
 The heights in the groups of students checked for shoe size were recorded. The data shows a typical distribution and the
mean can easily be calculated.
Heights 157
160
161
164
171
172
175
176
177
182
 Work out the total and the mean for this set of data:


The individual members of the group are different from the mean.
These differences can be calculated.
Height
157
160
161
164
171
172
175
176
177
182
Difference from
mean
Since some of the individuals fall below the average some of the differences will be negative. To convert all these values
into positive numbers they are squared.
Height
157
160
161
164 171 172 175 176 177
182
Squared
differences
 The figures above give a measure of the deviation of the individuals from the mean. The standard deviation is the mean
deviation. So you will need to find the mean of these values:
Total = 642.5 (explain what numbers were used to calculate this)
Mean = 64.25 (explain how this number was obtained)

Since this is the mean of the squares of the original deviation we use the square root of the mean and call it standard
deviation. Standard deviation √64.25 = 8.02

The standard deviation is a useful way to describe the variability in a set of continuous data. The larger the standard
deviation the larger the spread of data is around the mean.
Questions
The table below shows the heights of two groups of IB Biology students.
Group A heights / cm
180
176
160
169
172
178
182
177
175
Group B heights / cm
180
177
163
166
175
177
180
a.
Calculate the mean for each set of students.
b.
179
173
169
Calculate the standard deviation for each set of students.
Heights
Group / A
Difference
from mean
Square of
differences
180
176
160
Heights
Group / B
Difference
from mean
Square of
differences
180
177
163
169
166
172
175
178
177
182
180
177
179
175
173
169
Adapted with permission
from Sha Tin College-2/12
IB Biology
Topic 1: Statistical Analysis
Total of squared differences for group A:
Total of squared differences for group B:
Standard deviation for group A = √
=
Standard deviation for group B = √
=
The graph below shows three different groups of data:
Using your Graphical Calculator to Calculate Standard Deviation
1.
2.
3.
Turn the calculator on.
Press the stat key and press enter.
Input the data into L1. (use the data from side 6 about
heights)
4.
After all the data has been entered press the stat key again and
shift across to select CALC and press enter.
5.
Press the 2nd key followed by L1 and enter. This will give
information about your set of data.
Work out the following mean:
_
_
X = 169.5
X=
 X = 1695
X=
 X2 = 287945  X2 =
 X = 8.01
X=
 N = 10
N=
How Fast Are You?
How fast are you? How do you compare to your classmates? Human reaction time can be measured using a ruler and the
‘Drop-Catch’ method with a partner.
Procedure
1.Have your partner brace his or her writing hand on the edge of a desk or table, with the
fingers and thumb extending over the edge. Hold the ruler above your partner’s hand so
that the “0” line is level with the top of the thumb, as shown in the figure. The ruler should
be able to slide easily between your partner’s thumb and index finger.
2.Drop the ruler so that it falls straight down between your partner’s thumb and index
finger. Your partner should grab the ruler as quickly as possible. Read the number on ruler
just above your partner’s thumb and index finger. This is the distance the ruler fell before
your partner caught it. Record this number in your data table.
3. Repeat steps 1 and 2 four more times. You should have a total of five measurements in
your data table.
4. Have your partner switch hands so that he or she is catching the ruler with the nonwriting hand. Repeat steps 1 through 3. You should now have a total of 10 measurements for your partner: 5 for the writing
hand and five for the non-writing hand. Don’t forget to record them in the data table.
5.Switch places with your partner and repeat the whole exercise.
Adapted with permission
from Sha Tin College-3/12
IB Biology
Topic 1: Statistical Analysis
DATA TABLE – To Show Rates in Myself and My Partner When Catching a Ruler
Your Partner
You
Writing hand
Non-writing hand
Writing hand
Non-writing hand
/ distance /cm
/distance /cm
/ distance / cm
/ distance /cm
1
2
3
4
5
Average
Standard
Deviation
Questions:
Which person had the faster reaction time with the writing hand, you or your partner?
Was your average reaction time when you used your writing hand different from when you used your non-writing hand?
Was your partners?
What does the standard deviation tell us?
 The standard deviation as has been mentioned before is a measure of the variability of a set of data or, to be more
precise of its spread around the mean. By definition about 68% of all values lie within the range of the mean plus or
minus one standard deviation (i.e. X± 1s). About 95% of all values lie within the range of plus or minus 2 standard
deviations (i.e. X ± 2s).
 From your data from ‘How Fast Are You?’ If you were to get 1000 more readings for your writing hand,
theoretically………
 Between what numbers would 68% of your values lie?

Between what numbers would 95% of your values lie?
Questions:
1. A fish farmer sells 10 000 trout in a year (mean mass = 400g and  = 25g).
a. Assuming normal distribution, estimate the number of these that would be in the range ± 1 s.d.
b. The number that would have a mass greater than +1 s.d.
2. The pulse rates of 2400 patients were recorded and it was calculated that the mean value was 74 beats per minute with a
s.d. of 6 beats per minute.
What percentage of the patients had a pulse rate in the range 68-80 beats per minute?
How do we show variability in our data when we graph it?
 When carrying out an experiment and you want accurate results you usually take repeat measurements and then take the
mean of those measurements.
 Sometimes we need to be able to compare two measurements.
 But can we be sure that our two means are really different? Perhaps our data is not precise enough?
 One way to take into account the variability in the results and hence their level of accuracy is to draw error bars.
 A simple way to construct an error bar is to use the maximum deviation of a single data point away from the mean.
 When drawing a graph an error bar is drawn above and below the mean that shows the maximum deviation away from
the mean.
 Error bars can be constructed for each mean value:
Adapted with permission
from Sha Tin College-4/12
IB Biology
Topic 1: Statistical Analysis

If the error bars overlap then it cannot be concluded that the values are truly different. In biology we state that the
values are not significantly different.
 If the error bars do not overlap then a conclusion that they are significantly different is justified.
 Standard deviation error bars are more sophisticated indicator of the precision of a set of measurements. Standard
deviation error bars are usually drawn for 1 standard deviation above and below the mean.
 If standard deviation is calculated for a set of data you will need a minimum of five repeats.
Task
Using your data from ‘How Fast Are You’ for both you and your partner plot a graph with standard deviation error bars and
answer the questions on Soya beans:
1.
Are the responses times of your writing and your non-writing hand significantly different? Explain the reason for
your answer.
2.
Are your reaction times for your writing hand significantly different from those of your partner?
Yield of soya beans in plots at different altitudes in Zimbabwe. Yields are given as a mean and standard deviation
Looking at the data, use the graph to describe how the yield varies with altitude. (Remember to make a general conclusion
comparing the means and then use the error bars to discuss whether the data is precise enough to offer significant
differences.)
Adapted with permission
from Sha Tin College-5/12
IB Biology
Topic 1: Statistical Analysis
If you can see this great…………………..
If the difference in the means is less than the standard deviation of one or both samples then they will not be significantly
different. If you drew this onto a graph the error bars would overlap!!
1.1.5
Deduce the significance of the difference between two sets of data using calculated values for t and the
appropriate tables.
The student t test is a statistical test. One of the most common applications of statistics is to compare two sets of data, for
example the heights of males and females in a class. These heights can be represented as a frequency histogram using the
same x axis for both sets of data.
If almost all the male students were taller than the female students then the two histograms would show very little overlap, as
shown below in graph (a). From looking at this graph we would be confident in saying that the male students are taller than
the female students.
Fig 1: Comparing two sets of data. The triangle indicates the mean value for
each set of data.

As the overlap increases it becomes less certain that there is a difference. If the data looked like that shown in graph (b)
above where there is almost complete overlap, then we would be confident in saying that there is no difference in the
height of male and female students.
 It may appear from the graphs above that the difference between the mean values should be a sufficient measure of
overlap, i.e. as the means become closer the overlap increases. However, the overlap between the two sets of data also
depends on how closely the data are clustered around the two means.
Look at the two graphs below:



You should notice that the difference between the means is
the same.
However, the data used to plot graph (b) is more variable
there is more overlap, and less certainty that there is a
difference between the data.
The T test is a technique which will take into account the
means as well as the amount of overlap between two sets
of data and say how certain we are that there is a
significant difference.
Adapted with permission
from Sha Tin College-6/12
IB Biology
Topic 1: Statistical Analysis
The t-Test
Notation
__

X1 is the mean value for data set 1

Vertical lines indicate that the positive difference between the means should be taken,
irrespective of which is bigger
 S is the symbol for standard deviation
 n is the number of measurements collected
What
does the
** Note
yout-Test
will tell
notus?
be expected to remember this formula
It provides a way of measuring the overlap between two sets of data.
If two sets of data have widely separated means and small variances (the data is clustered around the mean) they will have
little overlap and a big value of t, they can be shown to be significantly different.
On the other hand if two sets of data have means that are close together and large variances (the data is spread from the
mean) they will have a large overlap and a small value of t, they can NOT be shown to be significantly different.
Adapted with permission
from Sha Tin College-7/12
IB Biology
Topic 1: Statistical Analysis
A large value of t indicates little overlap and a significant difference.
A small value of t indicates a lot of overlap and no significant difference.
To judge whether the value of t is big or small you have to consult a table known as ‘A Table of Critical Values’. The value
that should be looked at in the table depends on something known as ‘The Degrees of Freedom’. An example of a part of a
‘Table of Critical Values’ is shown below:
Degrees of Freedom
Significance levels
p = 0.05
2.13
2.12
2.11
2.10
2.09
2.09
2.08
2.07
2.07
2.06
2.06
2.04
2.00
2.00
15
16
17
18
19
20
21
22
23
24
25
30
40
60
p = 0.01
2.94
2.92
2.90
2.88
2.86
2.85
2.83
2.82
2.81
2.80
2.80
2.75
2.70
2.66

To work out the degrees of freedom:
Degrees of freedom = number of classes – 1
 So if there were 21 individuals in each sample then the degrees of freedom would equal:
Degrees of freedom = (21-1) + (21-1)
Degrees of freedom = 40
 Imagine carrying out a t test to compare two sets of data with 21 samples in each set and a value of t was calculated and
t = 3.42.
 Looking at the table, the critical value at for t at the 0.05 level (Biologist usually always look at this level) and with 40
degrees of freedom is 2.00. This means the probability of getting a value of t at least as large or larger than 2.00 by
chance is less than 0.05 (5%). So it is extremely unlikely that the difference in the two sets of data could have arisen by
chance. Therefore the two sets of data are significantly different. In fact 3.42 is also bigger than the value at 0.01 (1%)
which means that the probability of getting a value of t at least as large or larger than 2.70 by chance is less than 0.01
(1%).
 In investigations that will be analysed using statistical tests scientists usually make a null hypothesis. The null
hypothesis usually states that there is no significant difference between two samples.
 If a value of t is greater than or equal to the critical value then the null hypothesis can be rejected and it can be stated
that there is a significant difference.
Questions
 Below is some data obtained from Open University Students, who measured the lengths of leaves in 3 day germinated
wheat seedlings that had been given different treatments. Batch A were grown from normal seeds and batch B from
seeds that had been subjected to gamma radiation.
Normal, batch A Gamma irradiated batch B
___
X
10.9
2.3
mean leaf length / mm
S
Standard deviation/mm
3.97
1.52
n
sample size
15
15
Calculate the value of t using the equation. Show your work
Adapted with permission
from Sha Tin College-8/12
IB Biology
Topic 1: Statistical Analysis
How many degrees of freedom are there for this test?

Work out the value of t.

State a null hypothesis for this experiment and state whether it can be accepted or rejected with reasons:

A market gardener was testing the effectiveness of plastic plant pots over clay pots. He used seed from a pure inbred
line – so all seeds were the same genotype. He grew 10 plants in plastic pots and 10 plants in clay pots and observed
how long it took before each reached a flowering stage suitable for sale. Below are the results:
A – Clay
B - Plastic
Number (n)
10
10
Mean of time/days
95
100
Standard deviation (S)
3.2
4.6
Adapted with permission
from Sha Tin College-9/12
IB Biology
Topic 1: Statistical Analysis

State a null hypothesis:

Calculate a value of t and compare it with the values in the table on the previous page at the 5% probability level and the
correct degrees of freedom.

Make a comment about whether you would reject or accept the null hypothesis:

Using excel to calculate values of t.
yield of
potatoes
(kg)
Plot
Fertiliser A
1
2
3
4
5
6
7
8
9
10
mean
t-test P
st dev
27
20
16
18
22
19
23
21
17
19
20.2
7.88%
3.22
Fertiliser B
28
19
18
21
24
20
25
27
29
21
23.2
3.94
=TTEST (B3:B12, C3:C12, 2, 2)
Also format cell for %


=AVERAGE(C3:C12)
)
=STDEV(B3:B12)
On excel the t-test function is given by =TTEST(range1, range 2, tails, type)
In Biology assume a two tailed test. For type, type 1 refers to comparing data from the same individuals and type 2
when data is compared between different individuals. Example a different set of potatoes was compared with fertilizer
A and B therefore it is a two tailed test. If you compared the mean heart rates of all the members of your class before
and after they had drunk a cup of coffee it would be a one tailed test because you are looking at differences in the same
population.
What is good about using Excel?
 When excel calculates t it gives you a percentage. You do not have to consult a table of critical values. The percentage
tells us the probability that these two sets of data could be different due to chance. Remember Biologists work to a 5%
rule generally, so there has to be a 5% or less chance that these two sets of data could be different due to chance before a
Biologist states that the two sets of data are significantly different!

In the above example the yield of potatoes one treated with fertilizer A and the other treated with fertilizer B. It can be
seen that fertilizer B delivers a larger mean yield, but the t-test P shows that there is an 8% probability that these two
sets of data are not really different. Since this is more than 5% we must conclude that fertilizer B is not significantly
different.
Adapted with permission
from Sha Tin College-10/12
IB Biology

Topic 1: Statistical Analysis
An experiment was done to measure the pulse rates on 8 individuals before and after a large meal. The data is shown
below:
Subject
1
2
3
4
5
6
7
8


Pulse rate (bpm)
Before eating
After eating
105
109
79
87
79
86
103
109
87
90
74
78
73
78
82
89
Using excel perform a t-test and plot a graph.
Check your results with your teacher and then print out and keep your results.
Myth: Girls Can’t Catch
 Is there really a difference between boy’s and girls’ catching abilities?
 State a null hypothesis for this:

Design a method for collecting data that can be tested using the t-test.
Independent variable:
Dependent Variable:
Control variables:
Method
Collect your results and analyse them using a t-test. Can you bust the myth?
1.1.6
Explain the existence of a correlation does not establish that there is a causal relationship between two
variables.
Correlation is a statistical method that answers the question ‘Are these two variables associated?’. In other words, if one
variable changes does the other changes too? All living organisms respire and most need oxygen to do this. There are many
factors which affect the rate of oxygen consumption. One of these is temperature. Different organisms consume different
volumes of oxygen at different temperatures. Biologists studying this will want to know if there is an association between
oxygen consumption and temperature. The following graphs show scatter diagrams for an insect, the Colorado beetle, and a
chipmunk which is a small, squirrel-like mammal.
Adapted with permission
from Sha Tin College-11/12
IB Biology
Topic 1: Statistical Analysis
Two different types of association are shown here. With the
Colorado beetle, there is a positive association. In other words, as
the temperature increases, so does the rate of respiration. A line of
best fit slopes upwards. The scatter graph for the chipmunk, on the
other hand, shows, a negative association. As the temperature
increases, oxygen consumption decreases the line of best fit slopes
downwards. If there is no association then when a scatter graph is
plotted the points will be distributed randomly over the graph and it
would be extremely difficult to draw a line of best fit.
Causal Relationships or Not?
Although drawing a scatter graph can enable you to see of there is a
relationship between variables, it does not prove that ‘x causes y’.
The graph below shows a positive relationship between ice-cream
sales and cases of sunburn. The greater the number of ice-creams
sold the greater the number of sunburn cases.
It may seem obvious that ice cream does not cause sunburn but as scientists we have to be aware that a relationship between
two variables does not mean that one thing causes the other.
Some examples of questionable correlations:
Since the 1950s, both the atmospheric CO2 level and crime levels have increased sharply.
Hence, atmospheric CO2 causes crime.
The above example arguably makes the mistake of prematurely concluding a causal relationship where the relationship
between the variables, if any, is so complex it may be labeled coincidental. The two events have no simple relationship to
each other beside the fact that they are occurring at the same time.
scientific research finds that people who use cannabis (A) have a higher prevalence of psychiatric disorders
compared to those who do not (B).
This particular correlation is sometimes used to support the theory that the use of cannabis causes a psychiatric disorder (A is
the cause of B). Although this may be possible, we cannot automatically discern a cause and effect relationship from
research that has only determined people who use cannabis are more likely to develop a psychiatric disorder. From the same
research, it can also be the case that (1.) having the predisposition for a psychiatric disorder causes these individuals to use
cannabis (B causes A), OR (2.) it may be the case that in the above study some unknown third factor (e.g., poverty) is the
actual cause for there being found a higher number of people (compared to the general public) who both use cannabis and
who have been diagnosed as having a psychiatric disorder. Alternatively, it may be that the effects of cannabis are found
more pleasurable by persons with certain psychiatric disorders. To assume that A causes B is tempting, but further scientific
investigation of the type that can isolate extraneous variables is needed when research has only determined a statistical
correlation.
http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation
Adapted with permission
from Sha Tin College-12/12
Download