Checking assumptions - exploratory data analysis (EDA)

advertisement
Research Methods
1998
Graphical design and
analysis
 Gerry Quinn, Monash University, 1998
Do not modify or distribute without
expressed written permission of author.
Graphical displays
• Exploration
– assumptions (normality, equal variances)
– unusual values
– which analysis?
• Analysis
– model fitting
• Presentation/communication of results
Space shuttle data
Space shuttle data
• NASA meeting Jan 27th 1986
– day before launch of shuttle Challenger
• Concern about low air temperatures at
launch
• Affect O-rings that seal joints of rocket
motors
• Previous data studied
O-ring failure vs temperature
Pre 1986
3
2
1
0
50 55 60 65 70 75 80 85
Joint temp. oF
Challenger flight
Jan 28th 1986 - forecast temp 31oF
O-ring failure vs temperature
3
2
1
0
50
55
60
65
70 75
o
Joint temp. F
80
85
Checking assumptions exploratory data analysis (EDA)
• Shape of sample (and therefore population)
– is distribution normal (symmetrical) or skewed?
• Spread of sample
– are variances similar in different groups?
• Are outliers present
– observations very different from the rest of the
sample?
Distributions of biological data
Bell-shaped symmetrical
distribution:
Pr(y)
• normal
y
Skewed asymmetrical
distribution:
Pr(y)
y
• log-normal
• poisson
Common skewed distributions
Log-normal distribution:
• m proportional to s
• measurement data, e.g. length, weight etc.
Poisson distribution:
• m = s2
• count data, e.g. numbers of individuals
Exploring sample data
Example data set
• Quinn & Keough (in press)
• Surveys of 8 rocky shores along Point
Nepean coast
• 10 sampling times (1988 - 1993)
• 15 quadrats (0.25m2) at each site
• Numbers of all gastropod species and %
cover of macroalgae recorded from each
quadrat
Frequency distributions
Number of observations
Observations grouped into classes
NORMAL
Value of variable (class)
LOG-NORMAL
Value of variable (class)
Number of Cellana per quadrat
Frequency
30
Survey 5, all shores combined
Total no. quadrats = 120
20
10
0
0
20
40
60
80
100
Number of Cellana per quadrat
Dotplots
• Each observation represented by a dot
• Number of Cellana per quadrat, Cheviot
Beach survey 5
• No. quadrats = 15
0
10
20
30
Number of Cellana per quadrat
40
Boxplot
outlier
*
VARIABLE
largest value
} 25% of values
hinge
}
median
"
spread
hinge
smallest value
}
"
}
"
GROUP
1. IDEAL
2. SKEWED
3. OUTLIERS
*
*
*
*
*
4. UNEQUAL VARIANCES
Number of Cellana per quadrat
Boxplots of Cellana numbers in survey 5
100
80
60
40
20
0
S
FPE RR
SP CPE CB LB CPW
Site
Scatterplots
• Plotting bivariate data
• Value of two variables recorded for each
observation
• Each variable plotted on one axis (x or y)
• Symbols represent each observation
• Assess relationship between two variables
Number of Cellana
per quadrat
Cheviot Beach survey 5 n = 15
40
30
20
10
0
0
10
20
30
40
50
60
% cover of Hormosira per quadrat
70
Scatterplot matrix
• Abbreviated to SPLOM
• Extension of scatterplot
• For plotting relationships between 3 or
more variables on one plot
• Bivariate plots in multiple panels on
SPLOM
SPLOM for Cheviot Beach survey 5
CELLANA
- numbers of Cellana
SIPHALL
- numbers of Siphonaria
HORMOS
- % cover of Hormosira
n = 15 quadrats
Transformations
• Improve normality.
• Remove relationship between mean and
variance.
• Make variances more similar in different
populations.
• Reduce influence of outliers.
• Make relationships between variables more
linear (regression analysis).
Log transformation
Lognormal
Normal
y = log(y)
Measurement data
Power transformation
Poisson
Normal
y = (y), i.e. y = y0.5, y = y0.25
Count data
Arcsin  transformation
Square
Normal
y = sin-1((y))
Proportions and percentages
Outliers
• Observations very different from rest of
sample - identified in boxplots.
• Check if mistakes (e.g. typos, broken
measuring device) - if so, omit.
• Extreme values in skewed distribution transform.
• Alternatively, do analysis twice - outliers in
and outliers excluded. Worry if influential.
Assumptions not met?
• Check and deal with outliers
• Transformation
– might fix non-normality and unequal variances
• Nonparametric rank test
–
–
–
–
does not assume normality
does assume similar variances
Mann-Whitney-Wilcoxon
only suitable for simple analyses
30
25
20
15
10
5
0
1
2
3
4
5
Survey
6
7
8
9 10
Mean number of Cellana per quadrat
Mean number of Cellana per quadrat
Category or line plot
Cheviot Beach
Sorrento
30
25
20
15
10
5
0
1
2
3
4
5
6
7
Survey
8
9 10
Download