QA_QC Training Module.Oct2003.final

advertisement
Quality Assurance &
Quality Control
University of New Mexico




Kristin Vanderbilt
William Michener
James Brunt
Troy Maddux
University of South Carolina

Don Edwards
Structure of talk:



Types of errors
Define QA/QC
Procedures for reducing and detecting data
contamination




Designing data sheets
Validation rules, filters, lookup tables, control charts
Graphics and statistics
Outlier detection


Samples
Simple linear regression
Instruction Materials

Primary References

Michener and Brunt (2000) Ecological Data: Design, Management
and Processing. Blackwell Science.




Edwards (2000) Ch. 4
Brunt (2000) Ch. 2
Michener (2000) Ch. 7
Dux, J.P. 1986. Handbook of Quality Assurance for the Analytical
Chemistry Laboratory. Van Nostrand Reinhold Company
QA/QC

“mechanisms [that] are designed to prevent
the introduction of errors into a data set, a
process known as data contamination”
Brunt 2000
Errors (2 types)


Commission
Omission
Brunt 2000
Errors of Commission
Incorrect or inaccurate data in a dataset
Common
Can be (relatively) easy to find




Malfunctioning instrumentation





Sensor drift
Low batteries
Damage
Animal mischief
Data entry errors
Identifying Sensor Errors: Comparison of data
from three Met stations, Sevilleta LTER
Identification of Sensor Errors: Comparison of
data from three Met stations, Sevilleta LTER
Errors of Omission
Common
Difficult or impossible to find



Inadequate documentation of data
values, sampling methods, anomalies in
field, human errors
Quality Control

“mechanisms that are applied in advance,
with a priori knowledge to ‘control’ data
quality during the data acquisition process”
Brunt 2000
Quality Assurance

“mechanisms [that] can be applied after the
data have been collected, entered in a
computer and analyzed to identify errors of
omission and commission”


graphics
statistics
Brunt 2000
QA/QC Activities



Defining and enforcing standards for formats,
codes, measurement units and metadata.
Checking for unusual or unreasonable
patterns in data.
Checking for comparability of values between
data sets.
Brunt 2000
Process from raw data to verified data to
validated data (reviewed by a qualified
scientist) is called data maturity, and
implies increasing confidence in the
quality of the data through time
QA/QC should be
considered at all stages
of the research
process……
•Study Design
•Acquisition/Data Entry
•Data documentation
Flowering Plant Phenology – QA/QC
and study design example




What data are we collecting?
How often are we collecting?
Where are we collecting?
What is the experimental design?
Protocols are developed:



Description of the project design
(where, how, when data are collected)
Instructions for collecting the data
Sample data sheet
The plant phenology study has:



Four Sites
Three sites have three transects, one
site has two transects
Every species encountered will have
phenological stage recorded
Data Collection Form Development:
What’s wrong with this data sheet?
Plant
______________
______________
______________
______________
______________
______________
______________
______________
Life Stage
_______________
_______________
_______________
_______________
_______________
_______________
_______________
_______________
PHENOLOGY DATA SHEET
Collectors:_________________________________
Date:___________________ Time:_________
Location: black butte, deep well, five points, goat draw
Notes: _________________________________________
_______________________________________________
Plant
______________
______________
______________
______________
______________
______________
______________
______________
Life Stage
_______________
_______________
_______________
_______________
_______________
_______________
_______________
_______________
PHENOLOGY DATA SHEET
Collectors:_________________________________
Date:___________________ Time:_________
Location: black butte, deep well, five points, rio salado
Notes: _________________________________________
_______________________________________________
Plant
ardi
arpu
atca
bamu
zigr
P/G
P/G
P/G
P/G
P/G
P/G
P/G
V
V
V
V
V
V
V
Life Stage
B
B
B
B
B
B
B
FL
FL
FL
FL
FL
FL
FL
FR
FR
FR
FR
FR
FR
FR
M
M
M
M
M
M
M
S
S
S
S
S
S
S
D
D
D
D
D
D
D
NP
NP
NP
NP
NP
NP
NP
PHENOLOGY DATA SHEET
Collectors:_________________________________
Date:___________________ Time:_________
Location: black butte, deep well, five points, goat draw
Notes: _________________________________________
_______________________________________________
Plant
ardi
arpu
atca
bamu
zigr
P/G = perennating or germinating
V = vegetating
B = budding
FL = flowering
FR = fruiting
P/G
P/G
P/G
P/G
P/G
P/G
P/G
V
V
V
V
V
V
V
Life Stage
B
B
B
B
B
B
B
FL
FL
FL
FL
FL
FL
FL
FR
FR
FR
FR
FR
FR
FR
M
M
M
M
M
M
M
S
S
S
S
S
S
S
M = dispersing
S = senescing
D = dead
NP = not present
D
D
D
D
D
D
D
NP
NP
NP
NP
NP
NP
NP
PHENOLOGY DATA SHEET
Rio Salado - Transect 1
Collectors:_________________________________
Date:___________________ Time:_________
Location: black butte, deep well, five points, goat draw
Notes: _________________________________________
_______________________________________________
Plant
ardi
arpu
atca
bamu
zigr
P/G = perennating or germinating
V = vegetating
B = budding
FL = flowering
FR = fruiting
P/G
P/G
P/G
P/G
P/G
P/G
P/G
V
V
V
V
V
V
V
Life Stage
B
B
B
B
B
B
B
FL
FL
FL
FL
FL
FL
FL
FR
FR
FR
FR
FR
FR
FR
M
M
M
M
M
M
M
S
S
S
S
S
S
S
M = dispersing
S = senescing
D = dead
NP = not present
D
D
D
D
D
D
D
NP
NP
NP
NP
NP
NP
NP
QA/QC and Data
Acquisition:
Train technicians well!!
Data entry:

Use data entry programs that have


Validation rules or constraints on the range or
values that can be entered
Lookup Tables
Other methods for preventing data
contamination


Double-keying of data by independent
data entry technicians followed by
computer verification for agreement
Filters for “illegal” data

Computer/database programs


Legal range of values
Sanity checks
Edwards 2000
Flow of Information in Filtering “Illegal” Data
Raw
Data
File
Table of
Possible
Values
and Ranges
Illegal
Data
Filter
Report of
Probable
Errors
Edwards 2000
Illegal-data filter in SAS
Data Checkum; Set all;
Message=repeat(“ ”,39);
If Y1<0 or Y1>1 then do; message=“Y1 is not on
the interval [0,1]”; output; end;
If Y3>Y4 then do; message=“Y3 is larger than Y4”;
output; end;
…….(add as many checks as desired)
If message NE repeat(“ ”, 39);
Keep ID message;
Proc Print Data=Checkum;
QC in the Lab
Laboratory quality control using
statistical process control charts:

Determine whether analytical system is
“in control” by examining
 Mean
 Variability (range)
All control charts have three basic
components:



a centerline, usually the mathematical
average of all the samples plotted.
upper and lower statistical control limits
that define the constraints of common
cause variations.
performance data plotted over time.
X-Bar Control Chart
UCL
26
UWL
24
Mean
22
20
LWL
LCL
18
16
0
2
4
6
8
10
Time
12
14
16
18
20
Constructing an X-Bar control
chart



Each point represents a check standard
run with each group of 20 samples, for
example
UWL = mean + 2* standard deviation
UCL = mean + 3*standard deviation
Things to look for in a control chart:
The point of making control charts is to look
at variation, seeking patterns or statistically
unusual values. Look for:




1 data point falling outside the control limits
6 or more points in a row steadily increasing or
decreasing
8 or more points in a row on one side of the
centerline
14 or more points alternating up and down
Linear trend
12
10
8
6
4
2
0
0
2
4
6
time
8
10
12
Range control charts – use when check
standards are unavailable.

To construct an R chart:


Run 15-20 samples in duplicate, and
calculate absolute value of range between
duplicate readings, and calculate mean of
ranges (R)
50% of ranges should be above the line
corresponding to 0.845R, 95% above
2.456 R, and 3.27R corresponds to 99%.
Range Control Chart
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
time
12
14
16
18
QA/QC should be
considered at all stages
of the research
process……
•Study Design
•Acquisition/Data Entry
•Data documentation
QA/QC and metadata




Check data table attribute names—are
they reasonable?
Are variables defined?
Are missing values coded?
Are metadata complete?
Quality Assurance:

Graphical and statistical methods can be
used to detect:


Outliers
Influential points in regression analysis
Outliers


An outlier is “an unusually extreme value for
a variable, given the statistical model in use”
The goal of QA is NOT to eliminate outliers!
Rather, we wish to detect unusually extreme
values.
Edwards 2000
Outlier Detection

“the detection of outliers is an intermediate
step in the elimination of [data]
contamination”


Attempt to determine if contamination is
responsible and, if so, flag the contaminated
value.
If not, formally analyse with and without
“outlier(s)” and see if results differ.
Edwards 2000
Methods for Detecting Outliers

Graphics





Scatter plots
Box plots
Histograms
Normal probability plots
Formal statistical methods


Kolmogorov-Smirnov test of normality
Grubbs’ test
Edwards 2000
X-Y scatter plots
of gopher tortoise
morphometrics
Michener 2000
Example of
exploratory data
analysis using SAS
Insight®
Michener 2000
Box Plot Interpretation
Pts. > upper adjacent value
Upper adjacent value
Interquartile
range
Upper quartile
Median
Lower quartile
Lower adjacent value
Pts. < lower adjacent value
Box Plot Interpretation
IQR = Q(75%) – Q(25%)
Interquartile
range
Upper adjacent value = largest
observation < (Q(75%) + (1.5 X
IQR))
Lower adjacent value = smallest
observation > (Q(25%) - (1.5 X
IQR))
Box Plots Depicting Statistical Distribution of
Soil Temperature
Statistical tests for outliers assume that
the data are normally distributed….
CHECK THIS ASSUMPTION!
Normal density and Cumulative Distribution Functions
0. 0.2 0.4 0.6 0.8 1.0
0. 0.1 0.2 0.3 0.4
m
u
a
ll
a
F
ti v
re
eq
N
u
o
e
r
n
m
cy
a
-3
-2
-1
0
1
2
3
Y
-3
-2
-1
0
1
2
3
Y
Edwards 2000
Y
7 8 9 10 1 12
Normal Plot of 30 Observations from a Normal Distribution
es
-2
-10
of
1
2
Edwards 2000
S tandard
Normal Plots from Non-normally Distributed Data
- 2
1
0
1
2
Y
0. 1.0 2.0 3.0
Y
01234
e
s
ft
k
-t
e
ru
w
e
n
d
ca
d
ti
e
s
d
t ri
N
b
u
o
t
rm
i
o
-2
1
0
1
2
Q
of
uan
S
t
andar
t
iles
of
d
S
N
t
andar
or m
a
- 2
1
0
1
2
Y
2 4 6 8 10 12 14 16
Y-5 0 5
v
o
y
n
-t
ta
a
i
m
l
e
d
i
n
a
d
t
i
s
e
t
d
ri
b
N
u
o
t
rm
i
o
-2
1
0
1
2
Q
of
uan
S
t
andar
t
iles
of
d
S
N
t
andar
or m
a
Edwards 2000
Grubbs’ Text is a Statistical
Test to determine if a point is
an outlier.
Grubbs’ test for outlier detection:
Tn = (Yn – Ybar)/S
where Yn is the possible outlier,
Ybar is the mean of the sample, and
S is the standard deviation of the sample
Contamination exists if Tn is greater than T.01[n]
Example of Grubbs’ test for outliers:
rainfall in acre-feet from seeded clouds
(Simpson et al. 1975)
T26
4.1
32.7
118.3
200.7
274.7
489.1
1656.0
= 3.539 >
7.7
17.5
31.4
40.6
92.4
115.3
119.0
129.6
198.6
242.5
255.0
274.7
302.8
334.1
430.0
703.4
978.0
1697.8
2745.6
3.029; Contaminated
Edwards 2000
But Grubbs’ test is sensitive to non-normality……
Checking Assumptions on Rainfall Data
Contaminated
500
150
0
250
00
Quan
r a in
t
i
f
l
a
es
ll
-2
-1
0
1
2
of S
t
andar
p
l
l
o
g
1
0
(
r
a
i
n
f
)
0.5 1.5 2.5 3.5
l
o
g
1
0
(ra
d
.i
n
N
fa
o
l
rm
l
)
a
l
0 2 4 6 8 10
p
l
r
a
i
n
f
l
0 50 150 250
ra
i
n
b
f
.
a
N
l
l
o
rma
l
0 5 10 15 20
a
.
Normally
distributed
0.
1.
5
2.
5
3.
55
-2
-1
0
1
2
og10
Quan
(r
t
i
ai
les
nf
a
of
ll)
S
t
andar
Edwards 2000
Simple Linear Regression:
check for model-based……


Outliers
Influential (leverage) points
Influential points in simple linear
regression:

A “leverage” point is a point with an
unusual regressor value that has more
weight in determining regression
coefficients than the other data values.
Edmonds 2000
Influential Data Points in a Simple Linear Regression
p
y
50 10 150
l ev
erage
outl i er
and
l ev
erage
p
outl i er
0
10
20
30
40
50
Edwards 2000
x
Influential Data Points in a Simple Linear Regression
y
50 10 150
l ev
erage
p
outl i er
an
l ev
erage
p
outl i er
0
10
20
30
40
50
x
Edwards 2000
Influential Data Points in a Simple Linear Regression
y
50 10 150
l ev
erage
p
outl i er
an
l ev
erage
p
outl i er
0
10
20
30
40
50
x
Edwards 2000
Influential Data Points in a Simple Linear Regression
y
50 10 150
l ev
erage
p
outl i er
an
l ev
erage
p
outl i er
0
10
20
30
40
50
x
Edwards 2000
Brain weight vs. body weight, 63 species of terrestrial
mammals
Leverage pts.
b
r
a
i
n
w
e
g
h
t
(
)
0 10 20 30 40 50
A fri c an
s i an
E l eph
E l ephant
“Outliers”
M
an
0
2000
4000
6000
body
Edwards
2000
wei ght
(
k
g
Logged brain weight vs. logged body weight
l
o
g
1
0
b
r
a
i
n
w
e
h
t
(
g
)
-1 0 1 2 3
E l ephant
M
an
Chi nc hi l l a
“Outliers”
m i s punc hed
10
p
-2
-1
0
1
2
3
4
body
Edwards 2000
wei gh
Outliers in simple linear
regression
Observation
62
250
Phosphatase
200
150
100
50
0
0
50
100
150
200
250
Beta-Glucosidase
300
350
400
450
Outliers: identify using
studentized residuals

Contamination may exist if
|ri| > t
 = 0.01
/2, n-3
Simple linear regression:
Outlier identification
n = 86
t/2,83 = 1.98
Simple linear regression:
detecting leverage points
hi = (1/n) + (xi – x)2/(n-1)Sx2
A point is a leverage point if hi > 4/n,
where n is the number of points used to fit
the regression
Regression with leverage point:
Soil nitrate vs. soil moisture
Regression without leverage point
Observation 46
Output from SAS:
Leverage points
n = 336
hi cutoff
value:
4/336=0.012
The Key to QA/QC:
Make sure all personnel collecting,
entering, and analyzing data have a stake
in data quality!
Questions?
Download