CC image by tnarik on Flickr Lesson 4: Data Quality Control and Assurance Data Quality Control and Assurance • Definitions o Quality assurance, and Quality control o Data contamination o Types of errors • QA/QC best practices o Before data collection o During data collection/entry o After data collection/entry Data Quality Control and Assurance • After completing this lesson, the participant will be able to: o Define data quality control, and data quality assurance o Perform quality control and assurance on their data at all stages of the research cycle Data Quality Control and Assurance Plan Analyze Collect Integrate Assure Discover Describe Preserve Data Quality Control and Assurance CC image by Michael Coghlan on Flickr Data Contamination • Process or phenomenon, other than the one of interest, that affects the variable value • Erroneous values Data Quality Control and Assurance • Errors of Commission o Incorrect or inaccurate data entered o Examples: malfunctioning instrument, mistyped data • Errors of Omission o Data or metadata not recorded o Examples: inadequate documentation, human error, anomalies in the field Data Quality Control and Assurance • Strategies for preventing errors from entering a dataset • Quality assurance o Activities to ensure quality of data before collection • Quality control o Monitoring and maintaining the quality of data during the study Data Quality Control and Assurance • Define & enforce standards o Formats o Codes o Measurement units o Metadata • Assign responsibility for data quality o Be sure assigned person is educated in QA/QC Data Quality Control and Assurance • Double entry o Data keyed in by two independent people o Check for agreement with computer verification • Use text-to-speech program to read data back • When data is being re-used, incorporate electronically whenever possible • Design data storage well o Minimize number of times items that must be entered repeatedly o Use consistent terminology o Atomize data: one cell per piece of information • Document changes to data o Avoids duplicate error checking o Allows undo if necessary Data Quality Control and Assurance • Make sure data line up in proper columns • No missing, impossible, or anomalous values • Perform statistical summaries Data Quality Control and Assurance • Use an illegal data filter o Illegal data: values that are impossible or unlikely for a given variable o Generate filter using software such as Excel, R, MATLAB, or SAS Data Quality Control and Assurance Example of Illegal Data Filter Table 1 from Edwards (2000). An illegal-data filter, written in SAS (the data set "All" exists prior to this DATA step, containing the data to be filtered, variable names Y1, Y2, etc., and an observation identifier variable ID). Data Checkum; Set All; message=repeat(" ",39); If Y1<0 or Y1>1 then do; message="Y1 is not on the interval [0,1]"; output; end; If Floor(Y2) NE Y2 then do; message="Y2 is not an integer"; output; end; If Y3>Y4 then do; message="Y3 is larger than Y4"; output; end; : (add as many such statements as desired...) : If message NE repeat(" ",39); keep ID message; Proc Print Data=Checkum; Data Quality Control and Assurance • Look for outliers o Outliers are extreme values for a variable given the statistical model being used o The goal is not to eliminate outliers but to identify potential data contamination 60 50 40 30 20 10 0 0 5 10 Data Quality Control and Assurance 15 20 25 30 35 • Methods to look for outliers o Graphical • Normal probability plots • Regression • Scatter plots o Maps o Mean, median, and standard deviation • Be sure to transform data when looking for outliers graphically on a graph Data Quality Control and Assurance • Data contamination is data that results from a factor not examined by the study affecting data values • data error types: commission or omission • Quality assurance and quality control are strategies for o preventing errors from entering a dataset o ensuring data quality for entered data o monitoring, and maintaining data quality throughout the project • Identify and enforce quality assurance and quality control measures throughout the Data Life Cycle Data Quality Control and Assurance 1. 2. 3. D. Edwards, in Ecological Data: Design, Management and Processing, WK Michener and JW Brunt, Eds. (Blackwell, New York, 2000), pp. 7091. Available at www.ecoinformatics.org/pubs R. B. Cook, R. J. Olson, P. Kanciruk, L. A. Hook, Best practices for preparing ecological data sets to share and archive. Bull. Ecol. Soc. Amer. 82, 138-141 (2001). A. D. Chapman, “Principles of Data Quality:. Report for the Global Biodiversity Information Facility” (Global Biodiversity Information Facility, Copenhagen, 2004). Available at http://www.gbif.org/communications/resources/print-and-onlineresources/download-publications/bookelets/ Data Quality Control and Assurance We want to hear from you! CLICK the arrow to take our short survey. Data Quality Control and Assurance