Stats Intro

advertisement
STATISTICAL ANALYSIS FOR
BEHAVIORAL ECOLOGY
INTRODUCTION
Statistics are a necessary aspect of scientific studies and behavioral ecology is no
exception. This not only includes summary statistics like averages, but also includes hypothesis
testing, data relationships, and making sound predictions. Statistics help the scientist maintain
his or her objectivity when interpreting data and provide a solid basis against which sound
comparisons of data can be made. In so doing, statistical analyses aid scientists in making
discoveries that otherwise might have been obscured by investigator bias and predisposed ideas.
The following guide is intended to be a useful synopsis to some basic statistical
procedures as they may be applied to data collected in behavioral studies. Become familiar with
these procedures because statistics are only good if correctly applied to the correct type of data
and experimental setup. As you go through this exercise, pay attention to the experimental
designs (e.g. treatment applications with controls, comparing two data sets versus more than two,
or looking at relatedness of two data sets), type of data (e.g., counts versus measures) and the
statistical analysis used.
Because of the widespread availability of MicroSoft Excel, the statistical procedures
presented here use Excel to perform the various analyses. Other packages will certainly perform
the analyses as well as, or better in some cases, Excel and are certainly acceptable alternatives.
For other than basic summary statistics, you will need to follow the steps below to use the more
advanced analysis packages available on Excel.
Click on “Tools” and then “Add-Ins”.
Next, check the boxes beside “Analysis ToolPak” and “Analysis ToolPak-VBA” and
click “OK” (if the box is already checked continue to the next step).
3.
Open “Tools” again and open “Data Analysis”.
4.
Choose the appropriate analysis for your data:
ANOVA: Single Factor
Correlation
Regression
t-Test Assuming Equal Variances
1.
2.
DATA SUMMARY STATISTICS
I.
Mean
The mean is one measure of the central tendency of the data set, i.e., it gives an idea of
where the middle or center of the data set lies and therefore represents a good summary of the
data. To get the mean of a column of data entered in Excel:
1.
2.
Click on the cell where you want the mean to appear.
Type “=average(” and then highlight the data range for which you wish to calculate
then mean. Close the parentheses and press enter. The mean should appear in the
appropriate cell.
II.
Variation
Note that in the example given, both data sets have the same mean, yet the data sets are
certainly different. Which data set is more variable? Is set A more dispersed than B? We need
some measure of this variability within a data set in order to better compare the data. That is, we
need a number that will reflect the differences in these two data sets.
For example, we could use range of the data (1-21 vs. 10-12). However, two data sets
with truly different variability may have the same or similar ranges. So range is not a very good
measure of the variability in a data set. We prefer to use measures of variability that are based
on how much the data deviate from their mean. This measure is the “sum of squared deviations
from the mean” or simply called the “sum of squares”. Typically denoted SS or SSX.
There are a number of different measures of variation that may be calculated based on the
sum of squares.
Variance:
1.
Click on the cell where you want the variance to appear.
2.
Type “=var(” and then highlight the data range for which you wish to calculate
the variance. Close the parentheses and press enter. The variance of the data
should appear in the appropriate cell.
Standard Deviation:
1.
click on the cell where you want the standard deviation to appear.
2.
Type “=stdev(” and then highlight the data range for which you wish to calculate
the standard deviation. Close the parentheses and press enter. The standard
deviation of the data should appear in the appropriate cell.
Both of these measures reflect the variability of the data in the data set around the mean
of that data set. Basically, variance is “the average” of how far away each of the data
points are from the mean. The standard deviation indicates that 66% of the data points in
the set lie between the mean plus one standard deviation and the mean minus one
standard deviation.
Standard Error of the Mean:
1.
Click on the cell where you want the standard error to appear.
2.
Type “=stdev(your data range)/sqrt(n)”, where n=the number of observations in
your data set. Press enter. The program will compute the standard deviation of
your data set, then divide this by the square root of the number of observations in
the data set. This is the standard error.
HYPOTHESIS TESTING
I.
Null Hypothesis
Statistics are not only used to summarize, but also to test hypotheses. What is a
hypothesis? Basically it is a guess or assumption based upon some prior knowledge,
information, or claim. Consider the following example:
On a recent trip to the seashore, you notice that intertidal predatory snails are found more
often feeding on mussels than on barnacles.
Based upon this knowledge or information, you may hypothesize that the snails must take
longer to kill and eat mussels than they do for barnacles. As a result, since it takes them
longer to eat mussels, the snails are seen more frequently on the mussels simply because
they spend more time there.
Next, you wish to test your hypothesis about feeding time requirements for snails feeding
on mussels and barnacles. That is, you want to know if snails really do take longer to eat a
mussel than to eat a barnacle. To test a hypothesis, we state it as a null hypothesis (H0). We call
it this because we state it as there is no effect or no difference.
General Form of a H0-H0:
There is no difference (or effect) between (or among) groups.
For this Example-H0:
There is no difference in the amount of time it takes snails to kill and eat mussels
versus barnacles.
To test null hypotheses requires random sampling of the population to be studied and
experiments carefully designed to test the null hypothesis. Depending upon the results of the
experiment, you will either reject or fail to reject the null hypothesis. If the null hypothesis is
rejected, we must construct an alternative hypothesis (HA).
General Form of a HA-HA:
There is a difference (an effect) between (among) groups.
For this Example-HA:
It takes snails longer to eat (mussels or barnacles) than it does (mussels or
barnacle).
There are a variety of statistical analyses for testing hypotheses. Which one is used
depends upon the experimental setup (or design), the data and the hypothesis being tested (which
typically dictates the data and design).
II.
Comparing Two Groups of Data (or Samples)
The appropriate statistical test to compare two data sets is the two-sample t-test, or
simply the t-test. Essentially, the t-test will take into account the variation in each set and
allow for a direct comparison of the means. Look at the following example:
Feeding Times in Hours
Barnacles
10.1
11.4
11.7
12.1
13.3
Mussels
12.1
13.3
14.5
14.5
15.3
To perform a t-test enter your data into columns then:
1.
Highlight your first column of data including the label (NOTE: There must not be a
blank cell between your column label and your data).
Open “Tools”, and “Data Analysis” and choose “t-Test Two-Sample Assuming
Equal Variances”.
Your range of data should appear in the “Variable 1 Range” window. Click on the
“Variable 2 Range” window and enter the range for your second column of data.
Enter ‘0’ for the “Hypothesized Mean Difference”, check the Label box, and make
sure “Alpha” is 0.05.
Click on the “Output Range” circle and enter the column range in which you want
the output to appear in the “Output Range” window. Click “OK”.
Compare the absolute value of t-stat to t-critical two-tailed. If t-stat is > t-crit, then
reject your null hypothesis; otherwise you fail to reject your null hypothesis.
2.
3.
4.
5.
6.
In the example, we must reject the null hypothesis that there is no difference in the amount of
time it takes snails to eat mussels versus barnacles. We then go with an alternate hypothesis that
it takes snails longer to eat mussel. The relevance of the experiment is that we now have one
possible explanation for observing snails more often on mussels than on barnacles: because it
takes snails longer to eat mussels, hence they simply spend more time there and, as a result, you
would expect to find them there more often.
III. Comparing Three or More Groups of Data (or Samples)
The Analysis of Variance (ANOVA) is used when the experimental design has three
more sets of data to compare. ANOVA determines if there is a significant difference among
means of the groups. Consider carrying your interest in snail feeding further. For example, you
would now like to determine if the size of snail has an effect on the size of mussel eaten (i.e., do
bigger snails eat bigger mussels).
H0:
There is no difference among sizes of snails in the size of mussels chosen as prey.
So, you choose five snails, in each of three size-classes, offer each a variety of mussel sizes and
record the size of the first mussel eaten for each snail.
Small
(<20 mm)
41
44
48
43
42
Medium
(20-30 mm)
Large
(>30 mm)
48
49
49
49
45
40
50
44
48
50
Next, you will follow a procedure similar to that for the t-test except you will compute a sample
F statistic of Fs.
1.
2.
3.
4.
5.
6.
NOTE: Your data should be in columns and there must not be a blank cell between
your column label and your data. Also there must not be blank columns between your
data columns.
Open “Tools”, and “Data Analysis” and choose “ANOVA: Single Factor”.
Enter your entire range of data in the “Input Range” window.
Check the “Column” and “Label” boxes, and make sure “Alpha” is 0.05.
Click on the “Output Range” circle and enter the cell in which you want the output to
appear in the “Output Range” window. Click “OK”.
Compare the value of F to F-critical. If F is > F-crit, then reject your null hypothesis;
otherwise you fail to reject your null hypothesis.
In this example, we fail to reject the null hypothesis, H0: There is no difference among sizes of
snails in the size of mussels chosen as prey.
IV.
Analyzing Frequency (or Count) Data
Some experiments may require that the data be reflected as proportions or as percentages,
i.e., they are essentially based on counts or frequencies. When data are of this type Goodness of
Fit tests are used employing the G-statistic. In these types of tests you are essentially comparing
the frequencies you observe in your experiment to the frequencies you would expect to get based
on other assumptions or if all things were equal.
As an example, assume you want to know if, during mating season, stone crabs spend
more time doing mating, feeding or aggressive behaviors. So you construct a null hypothesis,
H0: there will be no difference among the frequency of mating, feeding or aggressive
behaviors in stone crabs during mating season.
To test this hypothesis, you observe stone crabs during mating season. Each night you count how
many times each crab you observe is involved in each one of the three behaviors and produce the
following data set:
Observed
Frequencies
Behavior (a = 3)
(f)
Expected
Frequencies
(f*)
mating
63
57.667
feeding
78
57.667
aggression
32
57.667
173
173.001
Total
The observed frequencies are what you actually observed in the field. The expected frequencies,
f*, were determined based on the expectation that if the crabs devoted similar amounts of time to
each behavior (i.e., they did not devote more time to any single behavioral category) you would
have expected to observed the crabs an equal number of times in each behavior category. Since
you have three categories, 1/3 of the 173 observed events (1/3 x 173 = 57.667) should have been
in each category. Next, compute the G statistic using the following equation:
G = 2 f i ln (
G = 2 [63 ln (
fi
)
f *i
63
78
32
) + 78 ln (
) + 32 ln (
)]
57.667
57.667
57.667
G = 20.5676
Then compare your observed G to χ2(0.05, 2)= 5.991.
If
G < χ2, then there is no difference among the categories with respect to the observed
frequencies; fail to reject the null hypothesis.
If
G > χ2, then there is a significant difference among categories with respect to the
observed frequencies; reject the null hypothesis.
In this example, we would reject the null hypothesis and construct an alternate hypothesis
that states that during mating season stone crabs do devote different amounts of time to mating,
feeding, and aggressive behaviors.
A similar analysis, the R x C G-Test of Independence, is employed if you want to
compare categories among different groups. R is the number of rows or categories your data set
has and C is the number of columns or groups in the data set. For example, after completing the
above study you now wish to see if stone crabs differ in the time devoted to the three behaviors
between mating and nonmating seasons. Your null hypothesis is
H0: crabs do not differ in the amount of time devoted to mating, feeding, or aggressive
behaviors between mating and nonmating seasons.
So, you do the same experiment only this time during the nonmating season as well.
Mating
Season
Behavior (a = 3)
(f)
Nonmating
Season
(f)
mating
63
4
67
feeding
78
136
214
aggression
32
13
45
173
153
326
Total
Totals
Step 1. transform and sum the frequencies in the entire table
63 ln 63 + 4 ln 4 +...+ 13 ln 13 = 1418.7549
Step 2. transform and sum the row totals
67 ln 67 + 214 ln 214 + 45 ln 45 = 1601.3331
Step 3. transform and sum the column totals
173 ln 173 + 153 ln 153 = 1661.1764
Step 4. transform the grand total
326 ln 326 = 1886.5285
Step 5. G = 2[step 1 - step 2 - step 3 + step 4]
G = 2[1418.7549 - 1601.3331 - 1661.1764 + 1886.5285
G = 85.5478
Next, compare your observed G statistic to a χ2 value with (C - 1)(R - 1) degrees of freedom or
in this case χ2(0.05, 2) = 5.991.
If
G < χ2, then the columns (groups) are not independent, i.e., they are the same with
respect to the observed frequencies in the three categories; fail to reject the null
hypothesis.
If
G > χ2, then the columns (groups) are independent, i.e., there is a significant difference
among the groups with respect to the observed frequencies in the three categories; reject
the null hypothesis.
In this example, G is obviously much greater than χ2(0.05, 2), so we reject our null
hypothesis and construct an alternate hypothesis indicating that there is a difference between
mating and nonmating seasons in the time stone crabs devote among the three behavioral
categories.
V.
Examining Relationships among Data
Often you need to determine if two data sets of independent variables are significantly
related, that is, you want to know if there is a correlation between the two variables (e.g. feeding
rate and tidal height). They are both considered independent variables because they are
independent of each other. In the mentioned example, obviously tidal height is not determined
by feeding rate. Conversely, feeding can proceed regardless of tidal height (in the laboratory, for
example). However, there may be some relationship between the two such that the animal in
question feeds faster at higher water levels than at lower (a positive correlation) or feeds slower
at higher water levels (a negative correlation). A correlation analysis is used to determine if
this relationship exists.
Alternatively, you may need to know the correlation between an independent and
dependent variable in order to make predictions about the dependent variable given the
independent variable (e.g., body mass and body length). Here, body mass is dependent upon
body length because if the organism grows longer it must also get heavier. However, body
length is considered independent of body mass; just because the organism got heavier does not
necessarily mean it got longer (i.e., maybe it got fatter). In certain situations, it may often be
easier and quicker to measure body length of an organism rather than mass even though mass is
what is needed. If you know the mathematical relationship between body length and body mass,
you can accurately predict the mass given the length. A regression analysis is used for this.
Fortunately, whether one is doing a correlation or regression analysis, the procedure is
the same. The only difference is the specific statistics one is interested in at the end of the
analysis.
Assume you are interested in predicting the dry tissue mass of a certain fish species. You
need to be able to predict this from body length for several reasons. First, the fish flop around
when they are placed on the balance. Not good. Second, since you will be presenting these fish
to a predator, they will need to be alive (fish are typically dead when you obtain their dry tissue
mass!). Third, you need to estimate the dry tissue mass of a live fish so that you can estimate
how much dry tissue the predator consumed. So, you begin by obtaining 20-30 fish in a wide
size range (a wide enough size range so as to cover any fish you will use in your upcoming
feeding experiment) to be sacrificed for the analysis. Next, you measure the length of each fish,
dry them in an oven, then weigh the dry remains. In the end, for each fish, you have its body
length and its respective dry tissue mass. You have generated the following data set:
(NOTE: the following data set only has 12 measures for the sake of illustration only; normally
many more measures would likely be needed. Also, these data are being used for both a
regression and a correlation analysis, again for illustrative purposes only. One would not
normally do both analysis on the same data set)
X
Body Length
(mm)
159
179
100
45
384
230
100
320
80
220
320
210
Y
Dry Tissue Mass
(g)
14.4
15.2
11.3
2.5
22.7
14.9
1.41
15.81
4.19
15.39
17.25
9.52
Correlation Analysis
1.
NOTE: There must not be a blank cell between your column label and your data.
2.
Open Tool, and Data Analysis and choose Correlation.
3.
Enter the ranges of your “X,Y” data set including labels.
4.
Check the Label box and make sure confidence level is 95%.
5.
Enter your output range and click “OK”.
6.
“r” will appear in the correlation matrix.
Regression Analysis
1.
NOTE: There must not be a blank cell between your column label and your data.
2.
Open Tool, and Data Analysis and choose Regression.
3.
Enter the ranges of your “X,Y” data set including labels.
4.
Check the Label box and make sure confidence level is 95%.
5.
Enter your output range and click “OK”.
6.
Examine the R-square value, it should be > 0.60.
7.
For linear equation information look under “Coefficients”. Intercept in the “Yintercept” and the one with your X variable label name is the slope of the line.
For the correlation analysis, the correlation coefficient (r) is used. It ranges from -1.0 to
1.0 and reflects the association between the two variables; the closer to -1.0 or 1.0 the better the
relationship. -1.0 or 1.0 are perfect relationships. The sign (+ or - ) tells if the relation is
negative (as one variable increases the other decreases) or positive (both variables increase
together). To determine if r is significant use column 1 at α = 0.05 and df = n - 2 in the table for
Critical Values for Correlation Coefficients. In this case df = 10 and the table value is 0.576.
Since our r is greater than the table value, we can say that the correlation is significant.
For a regression analysis, the difference is only in interpretation. One still computes r
and looks up its significance on the Critical Values for Correlation Coefficients table. If it is
significant, there is a chance you can use your newly formed equation to predict Y from X.
Compute r2. This value tells you how much of Y’s variation is explained by X. It ranges from
0.0 (no good) to 1.0 (excellent) where 1.0 means all of Y’s variation is accounted for by X. The
more of Y’s variation X accounts for, the better a predictor of Y is X based on your equation.
So, in our example we have a significant regression (r > rtable), but r2 = 0.75. This says that 75%
of the variation in Y is accounted for by X. That leaves 25% of Y’s variation unaccounted which
suggest this equation may not be so great for predicting Y from X.
Download