S6. Plant Ecology Research-Lab 4-basic stats lab

advertisement
Fleming, 2014, plant ecology research lesson, S6 Lab 4
Lab 4: Basic Statistics
There are many metrics by which we can quantify aspects of plant communities. In today’s lab you will
use the files “supplemental table SI-2” and “supplemental table SI-1” to quantify and compare selected
characteristics from sampled plant species.
The big questions to answer today are (1) what major trends or patterns can be discerned in these data
sets; (2) what differences among major groups of plants can be demonstrated with values within these
data sets?
Today, you will use the concepts of central tendency and variance to analyze data. Additionally, you
will analyze pairs of data sets against each other using a method called a t-test, or three or more data sets
against each other using a method called analysis of variance (ANOVA).
The MS Excel program is handy for these simple statistical tests if you have the “Data Analysis” toolpack added on, and of course, is also very useful for calculating averages of data in rows and/or columns.
However, we can all use a free online software computational website called “vassarstats” to do these
analyses as well. It’s robust, easy to use, costs nothing, provides links to logical and computational
details for each statistical method, and gives every user the exact same output format.
In groups of 2-3, work through the following exercise. If you each have a personal laptop, please bring it
with you and use it in lab so that you are all working in parallel. If you don’t have a laptop, please work
together closely with someone in your group who does. Or, if you prefer, the group can go to the
computer lab to complete its work, with some people working on their laptops and some on the desktops
in the computer lab.
Answers to the following questions are due next week in lab, one set of answers per group.
1. As a simple warm up, consider the following data scenario. Consider 10 college students riding a bus
to the CSU Stanislaus campus. Assume each of them by some astonishing coincidence makes exactly
$10,000 per year. What would the average income of this group of students be? Now assume that one
student gets off and Bill Gates gets on the bus! His income is about $1,000,000,000 per year
(wow!)…now, what is the average income of all the people on the bus?
“Average” can be expressed three different ways. The most common expression of an “average” is called
the mean, which is simply the sum of the sample values divided by the sample size. For example, if you
have the values 3,6,7 then the mean is (3+6+7)/3 = 5.33. Another expression of the “average” value is
called the median. The median value of a sample is simply the middle value of all the samples. For
example, if you have the values 3, 6, 7 then the median is 6. If you have the values 3,6,7,9 then the
median is 6.5 (take the mean of the middle two values). One tip: finding the median is much easier if you
order the samples from smallest to greatest first, then look for the middle value. The final expression of an
“average” is called the mode. The mode is simply the value that occurs most often in your sample. For
example, if you have the values 3,6,7 then there is no mode! If you have 3,3,6,7 then the mode = 3.
Central Tendency Method
Mean
Median
Median
Mode
Mode
Raw Data
3, 6, 7
3, 6, 7
3, 6, 7, 9
3, 6, 7
3, 3, 6, 7
Central Tendency Value
5.33
6
6.5
none
3
Fleming, 2014, plant ecology research lesson, S6 Lab 4
Of course, we also need to report the amount of variation in the data. As for averages, there are also
several ways to do this, but we will consider two methods here: the standard deviation and standard
error. Standard deviation is simply the mean value by which any one point in your data set deviates from
the sample mean. If there is much “spread” in the data, then standard deviation will be large compared to
a precise data set where there is little “spread”. By extension, standard error takes sample size into
account. If the sample size is large, then the value calculated for standard error will be small compared to
a small data set, which would have a larger value for standard error.
Let’s return to our example of the bus passengers and the values we might compute for their average
salaries. Calculate the mean, median, mode, standard deviation and standard error of these two samples.
See below for hints!
Income
(only students)
10,000
10,000
10,000
10,000
10,000
10,000
10,000
10,000
10,000
10,000
Income
(students + Bill Gates)
10,000
10,000
10,000
10,000
10,000
10,000
1,000,000,000
10,000
10,000
10,000
Mean
Median
Mode
Standard Deviation
Standard Error
NOTE:
Standard deviation is calculated as the square root of variance. Variance is calculated by summing the
squares of the deviations of each individual observation from the sample mean and dividing by one less
than the sample size. Before squaring, variance can be negative! We the square the sums of deviations of
each data point to get positive units, but that leaves us with “units squared”, which leads us to use the
square root of variance to obtain variation in terms of our original units. A helpful tutorial can be found
here: http://www.youtube.com/watch?v=qqOyy_NjflU
Variance = s2 = Σ(x-ẍ)2/n-1 so  standard deviation =s = square root (variance)
Standard error is calculated as the standard deviation divided by the square root of the sample size.
Fleming, 2014, plant ecology research lesson, S6 Lab 4
Another important concept in statistical analyses is “significance”. You have probably heard or used the
expression “there is no significant difference between these data sets.” What does “significant” mean in a
statistical context? That the results are meaningful or important? Not exactly. In statistics, a significant
result means that there is a low probability that the observed effect is attributable to chance alone. Said
another way, a significant result in statistical testing tells us that the observed effect is very likely
attributable to the variable(s) we manipulated in an experiment, or that two or more groups truly do differ
from each other on average most of the time.
2. Download the file “sample_speciesv2”. This data set is a simplified data set much like one you will
generate and use for your Red Hills project. For this exercise you can disregard the first three rows in the
file. The values in each cell represent the percent (%) cover for a particular species in a particular stand
(location). Use vassarstats.net to explore the data for patterns (your instructor will walk you through a
simple exercise to get you started with this website), and answer the following questions (a tutorial on
using vassarstats for a different type of statistical analysis can be found here):
https://www.youtube.com/watch?v=qrdMDnwFapE. Be sure to graph your findings in terms of averages
and standard deviation!
a. What is the mean (average) cover for each species? For each stand? What is the standard
deviation and standard error for each species and each stand?
b. What is the mean cover for each functional group? What is the mean functional group cover for
each stand? Provide standard deviations and standard errors as well.
c. Is tree cover significantly different from shrub cover for the 20 plots? A t-test will help you answer
this question. Be sure to use the “independent sample” option for t-tests. Should you use the
“equal variance” or “unequal variance” output? Why?
d. Is Grass 3 cover significantly different from Moss 1 cover? A t-test will help you answer this
question. Be sure to use the “independent sample” option for t-tests. Should you use the “equal
variance” or “unequal variance” option? Why?
e. Are there significant differences among mean cover of all the functional groups? ANOVA is
appropriate here (be sure to use the “single factor  independent sample” option). Which
functional group(s) differ from the others? How do you know?
f. Are Shrub 1, Grass 3, and Moss 1 cover significantly different from each other? An ANOVA test
(single factor  independent sample) will help you answer this question.
3. Use the greenhouse data (entered into S3 Table 1 – greenhouse activity data template) to answer the
following questions. Again, use vassarstats.net to explore the data and answer the following
questions. Be sure to graph your findings.
a. Is the average per stem biomass for radishes grown on flats of 8 species in 2013 significantly
different in from 2014?
b. Are there significant differences in per stem biomass in corn grown in every species combination?
For this question, lump together the 2013 and 2014 data so you have only 4 columns of data (1, 2,
4 and 8 species combinations).
Download