01-Ecology_Statistics_presentation_office_2007

advertisement
PCB 3043L - General Ecology
Data Analysis
OUTLINE
• Organizing an ecological study
• Basic sampling terminology
• Statistical analysis of data
– Why use statistics?
– Describing data
• Measures of central tendency
• Measures of spread
• Normal distributions
• Using Excel
–
–
–
–
Producing tables
Producing graphs
Analyzing data
Statistical tests
• T-Tests
• ANOVA
• Regression
Organizing an ecological study
•
•
•
•
•
•
•
•
•
What is the aim of the study?
What is the main question being asked?
What are your hypotheses?
Collect data
Summarize data in tables
Present data graphically
Statistically test your hypotheses
Analyze the statistical results
Present a conclusion to the proposed question
Basic sampling terminology
•
•
•
•
•
Variables
Populations
Samples
Parameters
Statistics
What is a variable?
• Variable: any defined
characteristic that varies
from one biological entity to
another.
• Examples: plant height, bird
weight, human eye color, no.
of tree species
• If an individual is selected
randomly from a population,
it may display a particular
height, weight, etc.
• If several individuals are
selected, their characteristics
may be very similar or very
different.
What is a population?
• Population: the entire
collection of measurements
of a variable of interest.
• Example: if we are
interested in the heights of
pine trees in Everglades
National Park (Plant height
is our variable) then our
population would consist of
all the pine trees in
Everglades National Park .
What is a sample?
• Sample: smaller groups or
subsets of the population
which are measured and
used to estimate the
distribution of the variable
within the true population
• Example: the heights of 100
pine trees in Everglades
National Park may be used
to estimate the heights of
trees within the entire
population (which actually
consists of thousands of
trees)
What is a parameter?
• Parameter: any
calculated measure used
to describe or
characterize a population
• Example: the average
height of pine trees in
Everglades National Park
What is a statistic?
• Statistic: an estimate of
any population parameter
• Example: the average
height of a sample of 100
pine trees in Everglades
National Park
Why use statistics?
• It is not always possible to obtain measures and calculate
parameters of variables for the entire population of interest
• Statistics allow us to estimate these values for the entire
population based on multiple, random samples of the variable
of interest
• The larger the number of samples, the closer the estimated
measure is to the true population measure
• Statistics also allow us to efficiently compare populations to
determine differences among them
• Statistics allow us to determine relationships between variables
Statistical analysis of data
Heights of pine trees at 2 sites
in Everglades National Park
Site 1
Site 2
5
7
3
8
6
4
2
8
3
7
• Measures of central tendency
• Measures of dispersion and variability
Measures of central tendency
• Where is the center of the distribution?
mean (
or μ): arithmetic mean…… x 
x
n
median: the value in the middle of the ordered data set
mode: the most commonly occurring value
Example data set : 1, 2, 2, 2, 3, 5, 6, 7, 8, 9, 10
Mean = (1 + 2 + 2 + 2+ 3 + 5 + 6 + 7 + 8 + 9 + 10)/11 = 55/11 = 5
Median = 1, 2, 2, 2, 3, 5, 6, 7, 8, 9,10 = 5
1, 2, 2, 2, 3, 5, 6, 7, 8, 9,10,11 = (5+6)/2 = 5.5
Mode = 1, 2, 2, 2, 3, 5, 6, 7, 8, 9, 10 = 2
Measures of dispersion and variability
• How widely is the data distributed?
range: largest value minus smallest value
variance
(s2
or
σ2)
2
(
x

x
)
 i
2


………….………….
n 1
2
standard deviation (s or σ)…………………
 
Large spread
Small spread
Measures of dispersion and variability
Example data set: 0, 1, 3, 3, 5, 5, 5, 7, 7, 9, 10
3.5
Number of Occurences
Variance = 9.8
Standard Deviation = 3.13
Range = 10
3
2.5
2
1.5
1
0.5
0
0
1
3
5
Value
7
9
10
90
100
Example data set: 0, 10, 30, 30, 50, 50, 50,
70, 70, 90, 100
3.5
Variance = 980
Standard Deviation = 31.30
Range = 100
Number of Occurences
3
2.5
2
1.5
1
0.5
0
0
10
30
50
Value
70
Normal distribution of data
• A data set in which most values are around the
mean, with fewer observations towards the
extremes of the range of values
• The distribution is symmetrical about the mean
Proportions of a Normal Distribution
A normal population of 1000 body weights
μ = 70kg
σ = 10kg
500 weights are > 70kg
500 weights are < 70 kg
Weights of Black Bears in Bunting Park
500
No. of bears
•
•
•
•
400
300
200
100
0
0
10 20 30 40 50 60 70 80 90 100 110 120 130 140
Weights (kg)
Proportions of a Normal Distribution
Weights of Black Bears in Bunting Park
No. of bears
500
400
300
200
100
0
0
10 20 30 40 50 60 70 80 90 100 110 120 130 140
Weights (kg)
• How many bears have a weight > 80kg
• μ = 70kg
σ = 10kg
X = 80kg
• We use an equation to tell us how many standard deviations
from the mean the X value is located:
= Z = 80 – 70 = 1
Z=X–μ
σ
10
• We then use a special table to tell us what proportion of a
normal distribution lies beyond this Z value
• This proportion is equal to the probability of drawing at
random a measurement (X) greater than 80kg
Z table
• Look for Z value on table (1.0)
• Find associated P value (0.1587)
• P value states there is a 15.87% ((0.1587/1)x100) chance that a
bear selected from the population of 1000 bears measured will
have a weight greater than 80kg
Probability distribution tables
• There are multiple probability tables for
different types of statistical tests.
e.g. Z-Table, t-Table, Χ2-Table
• Each allows you to associate a “critical value”
with a “P value”
• This P value is used to determine the
significance of statistical results
Using Excel
•
•
•
•
•
Program used to organize data
Produce tables
Perform calculations
Make graphs
Perform statistical tests
Organizing data in tables
• Allows you to arrange data in a format that
is best for analysis
• The following are the steps you would use:
Performing calculations
• Allows you to perform several calculations
• Sum, Average, Variance, Standard
deviation
• Basic subtraction, addition, multiplication
• More complex formulas
Making graphs
0.9
0.8
0.7
• Bar Charts…….
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
9
8
7
• Scatter Plots………………….
6
5
4
3
2
1
0
-1
0
0.2
0.4
0.6
0.8
1
1.2
Making graphs
0.9
0.8
0.7
• Bar Charts…….
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
9
8
7
• Scatter Plots………………….
6
5
4
3
2
1
0
-1
0
0.2
0.4
0.6
0.8
1
1.2
Analyzing Data in Excel
Statistical tests can be done to determine:
• Whether or not there is a significant difference
between two data sets (Student’s t-test)
• Whether or not there is a significant difference
between more than two data sets (ANOVA)
• Whether or not there is a significant relationship
between two variables (Regression analysis)
Analyzing Data in Excel
The following steps must be followed:
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
Analyzing Data in Excel
• Normally, you would have to calculate the critical
value and look up the P value on a table
• All tests done in Excel provide the P value for you
• This P value is used to determine the significance of
statistical results
• This P value must be compared to an α value
• α value is usually 0.05 or less (e.g. 0.01)
• Less than 5% chance that the null hypothesis is true
• The lower the α value the more certain we about
rejecting the null Hypothesis
• First thing you must do is select which statistical test
you want to perform
• This is how it is done……..
t-Tests
• Used to compare the means of two populations and answer the
question:
Is there a significant difference between the two populations?
• Example: Is there a significant difference between the average
height of pine trees from 2 sites in Everglades National Park?
• You cannot use this test to compare two different types of data
(e.g. water depth data and soil depth data).
• It can only compare two sets of data based on the same data
type (e.g. water depth data from two different sites)
• The two data sets that are being compared must be presented
in the same units. (e.g. you can compare two sets of data if
both are recorded in days. You cannot compare data recorded
in units of days with data recorded in units of months)
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
• Your Null Hypothesis is always:
There is no significant difference between the two
compared populations (μ1= μ2)
• Your Alternative Hypothesis is always:
There is a difference between the two compared
populations (μ1 ≠ μ2)
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
t-Tests
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
• When you run the test, look for the p-value
• If p > 0.05 then fail to reject your Null Hypothesis and state that
“there is no significant difference between the two compared
populations”
• If p < 0.05 then reject your Null Hypothesis and state that “there
is a significant difference between the two compared
populations”
t-Tests
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
• When you run the test, look for the p-value
• Our results show P = 0.09903
• Therefore P > 0.05 (This means that there is greater than a 5%
chance that our null hypothesis is true)
• So we must fail to reject the Null Hypothesis and state that “there
is no significant difference between the two compared
populations”
100
Number of Daily Beers
ANOVA
90
80
Micro
Eco
Buisiness
Statistics
70
60
50
40
30
20
10
0
Number of Students
• Used to compare the means of more than two populations and
answer the question:
Is there a significant difference between the populations?
• Example: Is there a significant difference between the average
height of pine trees from 4 sites in Everglades National Park?
• For comparing a particular feature of two or more populations,
use a Single Factor ANOVA
• For comparing a particular feature of two or more populations,
subdivided into two groups, use a Two Factor ANOVA
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
• Your Null Hypothesis is always:
There is no significant difference between the
compared populations (μ1 = μ2 = μ3 = μ4 …..)
• Your Alternative Hypothesis is always:
There is a difference between the compared
populations (μ1 ≠ μ2 ≠ μ3 ≠ μ4 …..)
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
ANOVA
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
• When you run the test, look for the p-value
• If p > 0.05 then fail to reject your Null Hypothesis and state that
“there is no significant difference between the compared
populations”
• If p < 0.05 then reject your Null Hypothesis and state that “there
is a significant difference between at least two of the compared
populations”
100
Number of Daily Beers
90
80
Micro
Eco
Buisiness
Statistics
70
60
50
40
30
20
10
0
Number of Students
ANOVA
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
• When you run the test, look for the p-value
• Our results show P = 0.002197
• Therefore P < 0.05 (This means that there is less than a 5%
chance that our null hypothesis is true)
• So we must reject your Null Hypothesis and state that “there is a
significant difference between at least two of the compared
populations”
100
Number of Daily Beers
90
80
70
60
50
40
30
20
10
0
Micro
Eco
Buisiness
Statistics
ANOVA
• Remember:
The ANOVA result will only tell you that
i) None of the data sets are significantly
different from each other
OR
ii) At least two of the data sets among the data
sets being compared are significantly
different
• If there is a significant difference between at
least two data sets, it will not tell you which
two.
Money Spent by TA ($)
60.00
Regression analysis
50.00
40.00
30.00
20.00
10.00
0.00
0
1
2
3
4
Price of Whiskey ($)
5
6
• Used to determine whether or not there is a linear relationship
between two variables and answer the question:
Is there a significant linear relationship between two variables?
• Example: Is there a significant relationship between the
average height of pine trees and soil depth in Everglades
National Park?
• It basically creates an equation (or line) that best predicts Y
values based on X values.
• You cannot use this test to compare populations. It only
compares variables.
• You are looking at two different variables (e.g. water depth (cm)
and plant abundance (no. of individuals), so the data sets do
not have to be presented in the same units
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
• Your Null Hypothesis is always:
There is no significant linear relationship between the
two variables
• Your Alternative Hypothesis is always:
There is a significant linear relationship between the
two variables
• R squared: how well “y” can be predicted by “x”, i.e. how
strong the linear relationship is between the two variables.
• The closer R square is to 0, the less well it fits the data.
• The closer R square is to 1, more it fits the data.
Example: R square value of 0.94
• The regression line fits the data well
• The points all lie fairly close to the
line, so there is a defined linear
relationship between the two
variables
• “x” can be used to predict “y”
Money Spent by TA ($)
1.2
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
1.2
Price of Whiskey ($)
60.00
Money Spent by TA ($)
Example: R square value of 0.04
• The regression line does not fit the
data well
• Many of the points lie far from the
line, so there is not a defined linear
relationship between the two
variables
• “x” cannot be used to predict “y”
50.00
40.00
30.00
20.00
10.00
0.00
0
1
2
3
4
Price of Whiskey ($)
5
6
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
Regression analysis
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
• When you run the test, look for the Significance F or Sample
p-value
• If p > 0.05 then fail to reject your Null Hypothesis and state
that “There is no significant linear relationship between the
two variables”
• If p < 0.05 then reject your Null Hypothesis and state that
“There is a significant linear relationship between the two
variables”
Money Spent by TA ($)
60.00
50.00
40.00
30.00
20.00
10.00
0.00
0
1
2
3
4
5
6
Regression analysis
1. Choose an appropriate statistical test
2. State H0 and HA
3. Run test to produce Test Statistic
4. Examine P-value
5. Decide to accept or reject H0
• When you run the test, look for the p-value
• Our results show Significance F or Sample p-value = 1.65E08 =
0.0000000165
• Therefore P < 0.05 (This means that there is less than a 5%
chance that our null hypothesis is true)
• So we must reject your Null Hypothesis and state that “There is
a significant linear relationship between the two variables”
60.00
Money Spent by TA ($)
• Next look at the R squared value
50.00
• Our results show R squared = 0.975
• Therefore the line fits the data well
• “x” can be used to predict “y”
40.00
30.00
20.00
10.00
0.00
0
1
2
3
4
5
6
Ecological study
•
•
•
•
•
•
•
•
•
What is the aim of the study?
What is the main question being asked?
What are your hypotheses?
Collect data
Summarize data in tables
Present data graphically
Statistically test your hypotheses
Analyze the statistical results
Present a conclusion to the proposed question
Aim: To determine whether or not there are changes in heights of Pine trees with
distance from the edge of a forest trail in Everglades National Park.
Hypotheses:
HO: There is no significant relationship between distance from the edge of the trail and
Pine tree height
HA: There is a significant relationship between distance from the edge of the trail and
Pine tree height
Results:
Change in tree height with distance from forest trail
Average tree height of pine trees along transect
from forest trail to interior forest at ENP
Plant heights (m)
0
2.1
5
2.7
10
2.9
15
3.1
20
3.4
25
3.7
30
3.8
35
4.5
40
4.6
45
4.8
50
5.6
SUM
41.2
AVERAGE
3.74
STANDARD DEVIATION
1.04
6
Tree height (m)
Distance from trail (m)
5
4
3
2
1
0
0
10
20
30
40
50
60
Distance from trail
• P = 1.65E-08
Since P < 0.05, reject Ho
• Therefore, there is a significant relationship
between distance from the edge of the trail
and Pine tree height
• R Square = 0.97, so there is a strong
positive linear relationship between
distance from the trail and plant height
Discussion/Conclusion:
The gap created by the trail may be adversely affecting Pine trees, such that they are
shorter near the trail and become taller with distance from the trail.
Assignment – Worksheet 1
Three questions:
1. T-test
2. Single factor ANOVA
3. Regression analysis
Download