QA_QC Training Module.Oct2003.final

Quality Assurance & Quality Control University of New Mexico     Kristin Vanderbilt William Michener James Brunt Troy Maddux University of South Carolina  Don Edwards Structure of talk:    Types of errors Define QA/QC Procedures for reducing and detecting data contamination     Designing data sheets Validation rules, filters, lookup tables, control charts Graphics and statistics Outlier detection   Samples Simple linear regression Instruction Materials  Primary References  Michener and Brunt (2000) Ecological Data: Design, Management and Processing. Blackwell Science.     Edwards (2000) Ch. 4 Brunt (2000) Ch. 2 Michener (2000) Ch. 7 Dux, J.P. 1986. Handbook of Quality Assurance for the Analytical Chemistry Laboratory. Van Nostrand Reinhold Company QA/QC  “mechanisms [that] are designed to prevent the introduction of errors into a data set, a process known as data contamination” Brunt 2000 Errors (2 types)   Commission Omission Brunt 2000 Errors of Commission Incorrect or inaccurate data in a dataset Common Can be (relatively) easy to find     Malfunctioning instrumentation      Sensor drift Low batteries Damage Animal mischief Data entry errors Identifying Sensor Errors: Comparison of data from three Met stations, Sevilleta LTER Identification of Sensor Errors: Comparison of data from three Met stations, Sevilleta LTER Errors of Omission Common Difficult or impossible to find    Inadequate documentation of data values, sampling methods, anomalies in field, human errors Quality Control  “mechanisms that are applied in advance, with a priori knowledge to ‘control’ data quality during the data acquisition process” Brunt 2000 Quality Assurance  “mechanisms [that] can be applied after the data have been collected, entered in a computer and analyzed to identify errors of omission and commission”   graphics statistics Brunt 2000 QA/QC Activities    Defining and enforcing standards for formats, codes, measurement units and metadata. Checking for unusual or unreasonable patterns in data. Checking for comparability of values between data sets. Brunt 2000 Process from raw data to verified data to validated data (reviewed by a qualified scientist) is called data maturity, and implies increasing confidence in the quality of the data through time QA/QC should be considered at all stages of the research process…… •Study Design •Acquisition/Data Entry •Data documentation Flowering Plant Phenology – QA/QC and study design example     What data are we collecting? How often are we collecting? Where are we collecting? What is the experimental design? Protocols are developed:    Description of the project design (where, how, when data are collected) Instructions for collecting the data Sample data sheet The plant phenology study has:    Four Sites Three sites have three transects, one site has two transects Every species encountered will have phenological stage recorded Data Collection Form Development: What’s wrong with this data sheet? Plant ______________ ______________ ______________ ______________ ______________ ______________ ______________ ______________ Life Stage _______________ _______________ _______________ _______________ _______________ _______________ _______________ _______________ PHENOLOGY DATA SHEET Collectors:_________________________________ Date:___________________ Time:_________ Location: black butte, deep well, five points, goat draw Notes: _________________________________________ _______________________________________________ Plant ______________ ______________ ______________ ______________ ______________ ______________ ______________ ______________ Life Stage _______________ _______________ _______________ _______________ _______________ _______________ _______________ _______________ PHENOLOGY DATA SHEET Collectors:_________________________________ Date:___________________ Time:_________ Location: black butte, deep well, five points, rio salado Notes: _________________________________________ _______________________________________________ Plant ardi arpu atca bamu zigr P/G P/G P/G P/G P/G P/G P/G V V V V V V V Life Stage B B B B B B B FL FL FL FL FL FL FL FR FR FR FR FR FR FR M M M M M M M S S S S S S S D D D D D D D NP NP NP NP NP NP NP PHENOLOGY DATA SHEET Collectors:_________________________________ Date:___________________ Time:_________ Location: black butte, deep well, five points, goat draw Notes: _________________________________________ _______________________________________________ Plant ardi arpu atca bamu zigr P/G = perennating or germinating V = vegetating B = budding FL = flowering FR = fruiting P/G P/G P/G P/G P/G P/G P/G V V V V V V V Life Stage B B B B B B B FL FL FL FL FL FL FL FR FR FR FR FR FR FR M M M M M M M S S S S S S S M = dispersing S = senescing D = dead NP = not present D D D D D D D NP NP NP NP NP NP NP PHENOLOGY DATA SHEET Rio Salado - Transect 1 Collectors:_________________________________ Date:___________________ Time:_________ Location: black butte, deep well, five points, goat draw Notes: _________________________________________ _______________________________________________ Plant ardi arpu atca bamu zigr P/G = perennating or germinating V = vegetating B = budding FL = flowering FR = fruiting P/G P/G P/G P/G P/G P/G P/G V V V V V V V Life Stage B B B B B B B FL FL FL FL FL FL FL FR FR FR FR FR FR FR M M M M M M M S S S S S S S M = dispersing S = senescing D = dead NP = not present D D D D D D D NP NP NP NP NP NP NP QA/QC and Data Acquisition: Train technicians well!! Data entry:  Use data entry programs that have   Validation rules or constraints on the range or values that can be entered Lookup Tables Other methods for preventing data contamination   Double-keying of data by independent data entry technicians followed by computer verification for agreement Filters for “illegal” data  Computer/database programs   Legal range of values Sanity checks Edwards 2000 Flow of Information in Filtering “Illegal” Data Raw Data File Table of Possible Values and Ranges Illegal Data Filter Report of Probable Errors Edwards 2000 Illegal-data filter in SAS Data Checkum; Set all; Message=repeat(“ ”,39); If Y1<0 or Y1>1 then do; message=“Y1 is not on the interval [0,1]”; output; end; If Y3>Y4 then do; message=“Y3 is larger than Y4”; output; end; …….(add as many checks as desired) If message NE repeat(“ ”, 39); Keep ID message; Proc Print Data=Checkum; QC in the Lab Laboratory quality control using statistical process control charts:  Determine whether analytical system is “in control” by examining  Mean  Variability (range) All control charts have three basic components:    a centerline, usually the mathematical average of all the samples plotted. upper and lower statistical control limits that define the constraints of common cause variations. performance data plotted over time. X-Bar Control Chart UCL 26 UWL 24 Mean 22 20 LWL LCL 18 16 0 2 4 6 8 10 Time 12 14 16 18 20 Constructing an X-Bar control chart    Each point represents a check standard run with each group of 20 samples, for example UWL = mean + 2* standard deviation UCL = mean + 3*standard deviation Things to look for in a control chart: The point of making control charts is to look at variation, seeking patterns or statistically unusual values. Look for:     1 data point falling outside the control limits 6 or more points in a row steadily increasing or decreasing 8 or more points in a row on one side of the centerline 14 or more points alternating up and down Linear trend 12 10 8 6 4 2 0 0 2 4 6 time 8 10 12 Range control charts – use when check standards are unavailable.  To construct an R chart:   Run 15-20 samples in duplicate, and calculate absolute value of range between duplicate readings, and calculate mean of ranges (R) 50% of ranges should be above the line corresponding to 0.845R, 95% above 2.456 R, and 3.27R corresponds to 99%. Range Control Chart 16 14 12 10 8 6 4 2 0 0 2 4 6 8 10 time 12 14 16 18 QA/QC should be considered at all stages of the research process…… •Study Design •Acquisition/Data Entry •Data documentation QA/QC and metadata     Check data table attribute names—are they reasonable? Are variables defined? Are missing values coded? Are metadata complete? Quality Assurance:  Graphical and statistical methods can be used to detect:   Outliers Influential points in regression analysis Outliers   An outlier is “an unusually extreme value for a variable, given the statistical model in use” The goal of QA is NOT to eliminate outliers! Rather, we wish to detect unusually extreme values. Edwards 2000 Outlier Detection  “the detection of outliers is an intermediate step in the elimination of [data] contamination”   Attempt to determine if contamination is responsible and, if so, flag the contaminated value. If not, formally analyse with and without “outlier(s)” and see if results differ. Edwards 2000 Methods for Detecting Outliers  Graphics      Scatter plots Box plots Histograms Normal probability plots Formal statistical methods   Kolmogorov-Smirnov test of normality Grubbs’ test Edwards 2000 X-Y scatter plots of gopher tortoise morphometrics Michener 2000 Example of exploratory data analysis using SAS Insight® Michener 2000 Box Plot Interpretation Pts. > upper adjacent value Upper adjacent value Interquartile range Upper quartile Median Lower quartile Lower adjacent value Pts. < lower adjacent value Box Plot Interpretation IQR = Q(75%) – Q(25%) Interquartile range Upper adjacent value = largest observation < (Q(75%) + (1.5 X IQR)) Lower adjacent value = smallest observation > (Q(25%) - (1.5 X IQR)) Box Plots Depicting Statistical Distribution of Soil Temperature Statistical tests for outliers assume that the data are normally distributed…. CHECK THIS ASSUMPTION! Normal density and Cumulative Distribution Functions 0. 0.2 0.4 0.6 0.8 1.0 0. 0.1 0.2 0.3 0.4 m u a ll a F ti v re eq N u o e r n m cy a -3 -2 -1 0 1 2 3 Y -3 -2 -1 0 1 2 3 Y Edwards 2000 Y 7 8 9 10 1 12 Normal Plot of 30 Observations from a Normal Distribution es -2 -10 of 1 2 Edwards 2000 S tandard Normal Plots from Non-normally Distributed Data - 2 1 0 1 2 Y 0. 1.0 2.0 3.0 Y 01234 e s ft k -t e ru w e n d ca d ti e s d t ri N b u o t rm i o -2 1 0 1 2 Q of uan S t andar t iles of d S N t andar or m a - 2 1 0 1 2 Y 2 4 6 8 10 12 14 16 Y-5 0 5 v o y n -t ta a i m l e d i n a d t i s e t d ri b N u o t rm i o -2 1 0 1 2 Q of uan S t andar t iles of d S N t andar or m a Edwards 2000 Grubbs’ Text is a Statistical Test to determine if a point is an outlier. Grubbs’ test for outlier detection: Tn = (Yn – Ybar)/S where Yn is the possible outlier, Ybar is the mean of the sample, and S is the standard deviation of the sample Contamination exists if Tn is greater than T.01[n] Example of Grubbs’ test for outliers: rainfall in acre-feet from seeded clouds (Simpson et al. 1975) T26 4.1 32.7 118.3 200.7 274.7 489.1 1656.0 = 3.539 > 7.7 17.5 31.4 40.6 92.4 115.3 119.0 129.6 198.6 242.5 255.0 274.7 302.8 334.1 430.0 703.4 978.0 1697.8 2745.6 3.029; Contaminated Edwards 2000 But Grubbs’ test is sensitive to non-normality…… Checking Assumptions on Rainfall Data Contaminated 500 150 0 250 00 Quan r a in t i f l a es ll -2 -1 0 1 2 of S t andar p l l o g 1 0 ( r a i n f ) 0.5 1.5 2.5 3.5 l o g 1 0 (ra d .i n N fa o l rm l ) a l 0 2 4 6 8 10 p l r a i n f l 0 50 150 250 ra i n b f . a N l l o rma l 0 5 10 15 20 a . Normally distributed 0. 1. 5 2. 5 3. 55 -2 -1 0 1 2 og10 Quan (r t i ai les nf a of ll) S t andar Edwards 2000 Simple Linear Regression: check for model-based……   Outliers Influential (leverage) points Influential points in simple linear regression:  A “leverage” point is a point with an unusual regressor value that has more weight in determining regression coefficients than the other data values. Edmonds 2000 Influential Data Points in a Simple Linear Regression p y 50 10 150 l ev erage outl i er and l ev erage p outl i er 0 10 20 30 40 50 Edwards 2000 x Influential Data Points in a Simple Linear Regression y 50 10 150 l ev erage p outl i er an l ev erage p outl i er 0 10 20 30 40 50 x Edwards 2000 Influential Data Points in a Simple Linear Regression y 50 10 150 l ev erage p outl i er an l ev erage p outl i er 0 10 20 30 40 50 x Edwards 2000 Influential Data Points in a Simple Linear Regression y 50 10 150 l ev erage p outl i er an l ev erage p outl i er 0 10 20 30 40 50 x Edwards 2000 Brain weight vs. body weight, 63 species of terrestrial mammals Leverage pts. b r a i n w e g h t ( ) 0 10 20 30 40 50 A fri c an s i an E l eph E l ephant “Outliers” M an 0 2000 4000 6000 body Edwards 2000 wei ght ( k g Logged brain weight vs. logged body weight l o g 1 0 b r a i n w e h t ( g ) -1 0 1 2 3 E l ephant M an Chi nc hi l l a “Outliers” m i s punc hed 10 p -2 -1 0 1 2 3 4 body Edwards 2000 wei gh Outliers in simple linear regression Observation 62 250 Phosphatase 200 150 100 50 0 0 50 100 150 200 250 Beta-Glucosidase 300 350 400 450 Outliers: identify using studentized residuals  Contamination may exist if |ri| > t  = 0.01 /2, n-3 Simple linear regression: Outlier identification n = 86 t/2,83 = 1.98 Simple linear regression: detecting leverage points hi = (1/n) + (xi – x)2/(n-1)Sx2 A point is a leverage point if hi > 4/n, where n is the number of points used to fit the regression Regression with leverage point: Soil nitrate vs. soil moisture Regression without leverage point Observation 46 Output from SAS: Leverage points n = 336 hi cutoff value: 4/336=0.012 The Key to QA/QC: Make sure all personnel collecting, entering, and analyzing data have a stake in data quality! Questions?

QA_QC Training Module.Oct2003.final

Related documents

Products

Support

QA_QC Training Module.Oct2003.final

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib