Simple Data Cleaning Tools and Methodologies

From Data to Uncertainty Principles of Data Quality Albatrosses, Kaikoura, New Zealand Arthur D. Chapman Australian Biodiversity Information Services Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 The Data Equation Oceans of Data Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Praia de Forte, Brazil The Data Equation Rivers of Information Ocean Biodiversity Informatics, Hamburg Doubtful Sound, New Zealand 29 Nov 2004 The Data Equation Streams of Knowledg e Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Wasatch, Utah, USA The Data Equation Drops of Understand ing Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 (Nix 1984) Taking Data to Information Species Data Species Data Crab Florianopolis, BrazilRock Cormorants Wandering Argentina Orca, San Albatros, NZ Francisco Armeria maritima Brown Algae, Argentina Algae, New Argentina zealand Corals, Australia Environmental Data Information Temp Rain Range Rain June Jan Decisions Policy Conservation Management GIS Data Models Decision Support Information Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 ThedoNeed fortoModelling Why we need use models? Cont. USA Brazil Australia Population: Collections: Plants: Reptiles: Mammals: Insects: Population: 172 million Collections: 50 million Plants: 70,000 Reptiles: 470-650 Mammals: 394 Insects: ?1.7 million Population: 20 million Collections: 35 million Plants: 20,000 Reptiles: 850-890 Mammals: 305 Insects: ?220,000 292 million 0.5-1 billion 18,000 350 428 ?150,000 Oceans Population : ~ 0 Collections: ?10 million Plants: ?? Vertebrates: ?? Invertebrates: ?? From OBIS 2004 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 What do we mean by ‘Data Quality’? An essential or distinguishing characteristic necessary for [spatial] data to be fit for use. SDTS 02/92 The general intent of describing the quality of a particular dataset or record is to describe the fitness of that dataset or record for a particular use that one may have in mind for the data. Chrisman, 1991 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Data quality - fitness for use? Fitness for use – Does species ‘A’ occur in Tasmania? – Does species ‘A’ occur in National Park ‘y’ Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 The Biological Data Domains Errors can occur in any one of these Plant and animal specimen data held in museums provide a vast information resource, providing not only present day information on the locations of these entities, but also historic information going back several hundred years (Chapman and Busby 1994). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Loss of data quality Loss of data quality can occur at many stages: • At the time of collection • During digitisation • During documentation • During storage and archiving • During analysis and manipulation • At time of presentation • And through the use to which they are put Don’t underestimate the simple elegance of quality improvement. Other than teamwork, training, and discipline, it requires no special skills. Anyone who wants to can be an effective contributor. (Redman 2001). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Principles of data quality The Vision: • It is important for organizations to have a vision with respect to having good quality data. • As well as a vision, an organization needs a policy to implement that vision. • And a strategy for implementation Experience has shown that treating data as a long-term asset and managing it within a coordinated framework produces considerable savings and ongoing value. (NLWRA 2003). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 The data quality vision A Vision may involve • Not reinventing information management wheels • Looking for efficiencies in data collection and quality control procedures • Sharing of data, information and tools • Using existing standards or developing new, robust standards • Fostering the development of networks and partnerships • Presenting a sound business case for data collection and management • Reducing duplication in data collection and data quality control • Looking beyond immediate use and examining requirements of users • Ensuring that good documentation and metadata procedures. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Strategies Short term - Data that can be assembled and checked over a 6-12 month period Intermediate - Data that can be entered over about an 18-month period with small investment of resources - Data that can be checked using simple in-house methods Long Term - Data that can be entered and/or checked over a longer time frame, using collaborative arrangements Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Information management chain From: Chapman 2004 Assign responsibility for the quality of data to those who create them. If this is not possible, assign responsibility as close to data creation as possible (Redman 2001) Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Data Cleaning Principles -1 • Planning is essential – develop a vision, a policy and strategy – Total Data Quality Management Cycle Data ownership and custodianship not only confers rights to manage and control access to data, it confers responsibilities 1 for its management, quality control and maintenance. Custodians also have a moral responsibility to superintend the data for Ocean use by future generations (Chapman 2004) 29 Nov 2004 Biodiversity Informatics, Hamburg Data Cleaning Principles - 2 • Organising Data improves efficiency – The organizing of data prior to data checking, validation and correction can improve efficiency and considerably reduce the time and costs of data cleaning. – For example, by sorting data on location, efficiency gains can be achieved through checking all records pertaining to the one location at the same time, rather than going back and forth to key references. – Similarly, by sorting records by collector and date, it is possible to spot errors where a record may be at an unlikely location for that collector on that day. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Data Cleaning Principles - 3 • Prevention is better than cure – It is far cheaper and more efficient to prevent an error from happening, than to have to detect it and correct it later. It is also important that when errors are detected, that feedback mechanisms ensure that the error doesn’t occur again during data entry, or that there is a much lower likelihood of it reoccurring. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Asplenium bulbiferum, New Zealand Data Cleaning Principles - 4 • Responsibility belongs to everyone (collector, custodian and user) – The principle responsibility belongs to the data custodian – The collector has responsibility to respond to the custodian’s questions when the custodian finds errors or ambiguities that may refer back to the original information supplied by the collector. These may relate to ambiguities on the label, errors in the date or location, etc. – The user also has a key responsibility to feed back to custodians information on any errors or omissions they may come across, including errors in the documentation associated with the data. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Data Cleaning Principles - 5 • Partnerships improve efficiency – By developing partnerships, many data validation processes won’t need to be duplicated, errors will more likely be documented and corrected, and new errors won’t be incorporated by inadvertent “correction” of suspect records that are not in error. – Partnerships with: • Data collectors • Other institutions with duplicate collections • Like-minded institutions developing tools, standards and software • Key data brokers (e.g. OBIS, GBIF) • Data users (good feedback mechanisms) • Statisticians and data auditors Yours is not the only organization that is dealing with data quality. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Data Cleaning Principles - 6 • Prioritisation reduces duplication – Prioritisation helps reduce costs and improves efficiency. It is often of value to concentrate on those records where lots of data can be cleaned at the lowest cost. – For example, those that can be examined using batch processing or automated methods, before working on the more difficult records. – By concentrating on those data that are of most value to users, there is also a greater likelihood of errors being detected and corrected. Tierra del Fuego, Argentina Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Prioritising data quality procedures • Focus on most critical data first • Concentrate on discrete units (taxonomic, geographic, etc.) • Ignore data that are not used or for which data quality cannot be guaranteed • Consider data that are of broadest value, are of greatest benefit to the majority of users and are of value to the most diverse of uses • Work on those areas whereby lots of data can be cleaned at the lowest cost (e.g. through use of batch processing). Not all data are created equal, so focus on the most important, and if data cleaning is required, make sure it never has to be repeated (Chapman 2004). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Data Cleaning Principles -7 • Set targets and performance measures – Performance measures are a valuable addition to quality control procedures, – They help an organization manage their data cleaning processes. – Performance measures may include statistical checks on the data (for example, 95% of all records are within 1,000 meters of their reported position), on the level of quality control (for example – 65% of all records have been checked by a qualified taxonomist within the previous 5 years; 90% have been checked by a qualified taxonomist within the previous 10 years). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Data Cleaning Principles - 8 • Minimise duplication and re-working of data – Duplication is a major factor with data cleaning in most organizations. – Many organizations add the geocode at the same time as they database the record. As records are seldom sorted geographically, this means that the same locations will be chased up a number of times. – By carrying out the georeferencing as a special operation, records from similar locations can then be sorted and then the appropriate map-sheet only has to be extracted once. – Some institutions also use the database itself to help reduce duplication by searching to see if the location may already have been georeferenced . Ocean Biodiversity Informatics, Hamburg Nothofagus antarctica, Argentina 29 Nov 2004 Data Cleaning Principles - 9 • Feedback is a two-way street – Users of the data will inevitably carry out error detection, and it is important that they feedback the results to the custodians. – It is essential that data custodians encourage feedback from users of their data, and take the feedback that they receive seriously. – Data custodians also need to feed back information on errors to the collectors and data suppliers where relevant. – In this way there is a much higher likelihood that the incidence of future errors will be reduced and the overall data quality improved. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Data Cleaning Principles - 10 • Education and training improves techniques – Poor training, especially at the data collection and data entry stages of the Information Quality Chain, is the cause of a large proportion of the errors in primary species data. – Good training of data entry operators can reduce the error associated with data entry considerably, reduce data entry costs and improve overall data quality. Brown Algae, Argentina Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Data Cleaning Principles - 11 • Accountability, Transparency and Audit-ability are important – Haphazard and unplanned data cleaning exercises are very inefficient and generally unproductive. – Within data quality policies and strategies – clear lines of accountability for data cleaning need to be established. – To improve the “fitness for use” of the data and thus their quality, data cleaning processes need to be transparent and well documented with a good audit trail to reduce duplication and to ensure that once corrected, errors never re-occur. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Data Cleaning Principles - 12 • Documentation is the key to good data quality – Without good documentation, it is difficult for users to determine the fitness for use of the data and difficult for custodians to know what and by whom data quality checks have been carried out. – Documentation is generally of two types. – The first is tied to each record and records what data checks have been done and what changes have been made and by whom. – The second is the metadata that records information at the dataset level. – Both are important, and without them, good data quality is compromised. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Recording Accuracy and Error Additional Accuracy Fields –Preferably in meters (Point-Radius) Documenting Validation tests – Who – What – How Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Methods for geocode validation • • • • • Internal Database Checks External Database Checks Outliers in Geographic Space - GIS Outliers in Environmental Space - Models Statistical outliers Ocean Biodiversity Informatics, Hamburg Butterfly, Florida, USA 29 Nov 2004 Internal/External Database Checks • Logical inconsistencies within the database • Checking one field against another – Text location vs geocode or District/State • Checking one database against another – Gazetteers – DEM – Collectors Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Magellanic Penguin, Argentina Error Error is inescapable and it should be recognised as a fundamental dimension of data. Chrisman 1991 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Geographic outliers - GIS Country, State, named district, etc. Gazetteer of Brazilian localities Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 35 1400 30 1200 Rainfall (mm) Temperature (C) Diva-GIS - Outlier 25 20 15 www.diva-gis.org 1000 800 600 400 • Reverse200jack-knifing technique • Threshold0 value t = 0.95(n) +0.2 10 5 0 t a n n t m n c m t m x w m t s p a n Ocean t c l q t w m q Biodiversity t w e t q Informatics, t d r y q Hamburg r w e t m r d r y m 29 Nov r c v a r 2004 r w e t q r d r y q r c l q r w m q CRIA-Data Cleaning http://splink.cria.org.br/dc/ Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Principal Components Analysis - FloraMap Image from FloraMap (Jones and Gladkov 2001) showing use of Principal Components Analysis to identify an outlier in Rauvolfia littoralis specimen data. A. Principal Components Analysis B. Specimen record. C. Mapped specimen. Ocean Biodiversity Informatics, Hamburg D. Climate profile 29 Nov 2004 Cumulative Frequency Curves - DivaGiS Results from Diva-GIS showing the use of the Cumulative Frequency curve from BIOCLIM to identify possible geocoding errors in Rauvolfia littoralis. A1 and A2 show possible outliers in climate space, B1 and B2 the corresponding mapped records. The Blue lines represent the 97.5 percentile Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Environmental Outliers • Cumulative Frequency Curves Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Errors in data Although most data gathering disciplines treat error as an embarrassing issue to be expunged, the error inherent in (spatial) data deserves closer attention and public understanding. Chrisman, 1991 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Errors in data - 2 In general, error must not be treated as a potentially embarrassing inconvenience, because error provides a critical component in judging fitness for use. Chrisman, 1991 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Mizodendrum sp., Argentina Challengers FutureFuture Challengers • Improved data quality • Improved documentation of data • Improved access to distributed data • Improved methods for modelling in aquatic (including marine) environments • Decision Support Systems Enlightened Policy / Decision Makers!!! Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Thank You… Questions? Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Acknowledgements • • • • • • • • • • Brazilian Biota/FAPESP Virtual Biodiversity Institute Program Reference Centre for Environmental Information, Brazil (CRIA) Global Biodiversity Information Facility (GBIF) UNESCO Wesleyan University, Connecticut, USA Peabody Museum, Yale University, USA ETI, Holland UN Food and Agriculture Organization (FAO) Environmental Resources Information Network, Australia (ERIN) Commission on Data for Science and Technology (CODATA) Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

Simple Data Cleaning Tools and Methodologies

Related documents

Products

Support

Simple Data Cleaning Tools and Methodologies

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib