Data Analysis Dr. Doug McLaughlin November 18, 2013

advertisement
Data Analysis
Dr. Doug McLaughlin
November 18, 2013
The goal of this lecture is to
encourage deeper thinking about
data or values we see or derive
ourselves.
What’s In A Number?
• “The concentration of dioxin in fish is 0.6 parts
per trillion.”
What’s In A Number?
• “The concentration of dioxin in fish is 0.6 parts
per trillion.”
• Important questions to ask:
– About the objective (why do we care?)
– About the value (what does this number
represent)
– About the data (what else should we know about
the data set it comes from? Is data quality
acceptable?)
Some Good Questions To Ask
About the Value
• What does “0.6” represent?
– Mean? Median? Geometric mean?
– From a representative “sample”?
– How sure are we (confidence limits)?
– Is it changing (trends over time)?
– Data assumptions (e.g., not skewed, no values
below detection limits)?
Concentration Units
Concen
tration
SI
Prefix
Name
g/L
1 mg/L
1 ug/L
1 ng/L
-milli
micro
nano
1 pg/L
1 fg/L
“Parts
per…”
thousand
million
billion
trillion
Factor
(Decimal
Notation)
1
0.001
0.000001
0.000000001
pico quadrillion 0.000000000001
femto quintillion 0.000000000000001
Factor
(Scientific
Notation,
g/L)
100
10-3
10-6
10-9
10-12
10-15
USEPA Data Quality Objectives Process:
A 7 Step Framework for Data Collection, Data Analysis, and
Decision-Making
USEPA Data Quality Objectives Process:
A Framework for Data Collection, Data Analysis, and DecisionMaking
How hazardous
is consuming
fish from a
river?
Compare
chemical
concentrations
in fish tissue with
health guidance.
Determine
chemical
concentrations
in representative
fish tissue.
USEPA Data Quality Objectives Process:
A Framework for Data Collection, Data Analysis, and DecisionMaking - continued
USEPA Data Quality Objectives Process:
A Framework for Data Collection, Data Analysis, and DecisionMaking - continued
Most
commonly
caught fish
species from
River x.
Measure
contaminant
concentrations
in whole body
tissue samples.
Compare to
health guidance
values.
USEPA Data Quality Objectives Process:
A Framework for Data Collection, Data Analysis, and DecisionMaking
USEPA Data Quality Objectives Process:
A Framework for Data Collection, Data Analysis, and DecisionMaking
How certain
must the
estimated mean
be, i.e., how
small must the
confidence
interval on the
estimated mean
be?
How many
samples are
needed ? What
analytical
method should
be used?
Example Data Set
• 37 “dioxin”
concentration
measurements from
fish collected
downstream of a
pulp and paper mill
Year0
0
0
0
0
0
3
3
3
7
7
7
7
7
7
7
7
7
Dioxin (ppt)
1.8
1.7
2.6
0.84
1.1
1.4
0.26
1.1
0.63
0.4
0.28
0.79
0.4
0.31
0.2
0.51
0.44
Year0
10
10
10
10
10
10
10
10
10
10
13
13
13
13
13
13
13
13
13
13
Dioxin (ppt)
<0.25
0.79
<0.39
<0.23
<0.25
0.37
0.37
0.27
0.30
0.26
0.12
0.24
0.24
0.29
0.36
0.38
0.33
0.24
0.38
<0.12
Data Summary Statistics
Assumes “less thans” are equal to the detection limit
Parameter
Value
N
37
Variance
0.29
Mean
0.57
S.D.
0.54
Median
0.37
Coef. Var.
(S.D./Mean)
95%
25th
percentile
0.26
Min.
0.12
75th
percentile
0.71
Max.
2.6
Example Data Set
Assumes “less thans” are equal to the detection limit
Understanding Data Distributions
Understanding Data Distributions
Examples of Normal and Lognormal
Distributions
Distribution Plot
Distribution Plot
Normal, Mean=0.57, StDev=0.29
Lognormal, Loc=0.57, Scale=0.57, Thresh=0
0.5
1.4
1.2
0.4
0.8
De nsity
Density
1.0
0.6
0.4
0.2
0.1
0.2
0.0
0.3
-0.5
0.0
0.5
X
1.0
1.5
0.0
0
1
2
3
X
4
5
6
7
8
Making Assumptions About
“Censored” Values (“Nondetects”)
• Assume/substitute specific values for NDs
– 0, ½ detection limit, full detection limit are
common substitutions
• Convenient, but can lead to incorrect
conclusions in certain cases.
– Hard to predict when problems will arise
• Are there alternatives? Yes. One example is
the Kaplan-Meier procedure for estimating a
mean.
Effect of “Less Than” Substitution
Assumption
DL = detection limit
S.D. = standard deviation
Coef. Var. = coefficient of
variation
Parameter
ND=DL
ND=0
N
37
37
Variance
0.29
0.32
Mean
0.57
0.53
ND=1/2
DL
37
0.55
0.30
S.D.
0.54
0.56
0.55
Median
0.37
0.36
0.36
75th percentile
0.71
0.71
0.71
Coef. Var.
(S.D./Mean)
25th percentile
Min.
Max.
95%
0.26
0.12
2.6
106%
0.24
0
2.6
100%
0.24
0.06
2.6
Effect of “Less Than” Substitution
Assumption
DL = detection limit
S.D. = standard deviation
Coef. Var. = coefficient of
variation
Parameter
ND=DL
ND=0
N
37
37
Variance
0.29
0.32
Mean
0.57
0.53
ND=1/2
DL
37
0.55
0.30
S.D.
0.54
0.56
0.55
Median
0.37
0.36
0.36
75th percentile
0.71
0.71
0.71
Coef. Var.
(S.D./Mean)
25th percentile
Min.
Max.
95%
0.26
0.12
2.6
106%
0.24
0
2.6
100%
0.24
0.06
2.6
Effect of “Less Than” Substitution
Assumption
DL = detection limit
S.D. = standard deviation
Coef. Var. = coefficient of
variation
Parameter
ND=DL
ND=0
N
37
37
Variance
0.29
0.32
Mean
0.57
0.53
ND=1/2
DL
37
0.55
0.30
S.D.
0.54
0.56
0.55
Median
0.37
0.36
0.36
75th percentile
0.71
0.71
0.71
Coef. Var.
(S.D./Mean)
25th percentile
Min.
Max.
95%
0.26
0.12
2.6
106%
0.24
0
2.6
100%
0.24
0.06
2.6
Example Data Set
• 37 “dioxin”
concentration
measurements
from fish
collected
downstream of
a pulp and
paper mill
Sample No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Dioxin (ppt)
1.8
1.7
2.6
0.84
1.1
1.4
0.26
1.1
0.63
0.4
0.28
0.79
0.4
0.31
0.2
0.51
0.44
Sample No.
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Dioxin (ppt)
<0.25
0.79
<0.39
<0.23
<0.25
0.37
0.37
0.27
0.30
0.26
0.12
0.24
0.24
0.29
0.36
0.38
0.33
0.24
0.38
<0.12
Download