Spatial data quality February 10, 2006 Geog 458: Map Sources and Errors Outlines • • • • • • • Why is spatial data quality an issue? Taxonomy of uncertainty Anatomy of error Definition of quality Assessing data quality Levels of testing Component of spatial data quality Why is spatial data quality an issue? Examples • Buying land parcel – Discrepancy between area of land parcel measured in land parcel map and through GPS receivers • Land use change research – Classification accuracy of remotely sensed images & its effect on policy-making Why is spatial data quality an issue? • Spatial data is used in many decisions and analyses. • Increasing availability, exchange and use of spatial data – Good news: more awareness of importance of geographic information – Bad news: poor-quality data is increasingly available • Growing users less aware of spatial data quality • Gap between data producers and data users • GIS do not usually provide functionality for analyzing error propagation • Spatial data quality has implications for decision-making – People do not fully appreciate the consequences of poor-quality data whereas they are easily convinced by pretty maps • To resolve issues illustrated in the examples, – Uncertainty should be defined precisely • Taxonomy of uncertainty • How to quantify error? – The effect of input data on analysis should be demonstrated • How to validate error? Taxonomy of uncertainty • Real world to spatial data • The discrepancy between real world and spatial data is unavoidable; • The transformation (from real world to data) can be partitioned into two steps, and errors are inherent in each step – Interpretation (or conceptualization) • Vagueness: arises due to poor definition • Ambiguity: arises due to disagreement – Measurement • Error: discrepancy between observed value and true value; can be measured only when it has clear definition about what constitutes the truth Anatomy of error • Error is the discrepancy between observed value and true value Measurement Mean of measurement Systematic error (=bias) Random error Test data set True value Total error Reference data set Accuracy is calculated from total error; closeness of an observation to a true value Precision is calculated from random error (see Figure 6.11 at p.140) Anatomy of error • Precision – Statistical: variability among repeated measurements – Storage: amount of details that can be discerned • Resolution – Detail in which the data is presented – Minimum distance which can be recorded • Scale – Map scale: ratio of ground distance to earth distance • Discuss the relation of these terms to accuracy Anatomy of error • Statistical precision (deviation from sample mean) is called relative accuracy • Storage precision sets a lower bound on resolution • Resolution sets a lower bound on accuracy • Geographic scale and resolution are separate especially in digital maps even though they are historically related – Large scale hardcopy map has high resolution – Small scale hardcopy map has low resolution Definition of quality • See the reading C4 (Chrisman 1984) – Degree of excellence – Meeting an expectation – Fitness for use – Conformance to a standard • “Totality of characteristics of a product that bear on its ability to satisfy stated and implied needs” (ISO 19113) Who assesses data quality? • Minimum quality standards – Data product should pass quality standard – Data producers are responsible for assessing data quality – May be too inflexible • Metadata standards – Does not impose quality standard under the belief that errors are inherent – Data producers simply provide documentation (i.e. truth in labeling) – Data users are responsible for determining fitness-for-use of the data • Market standards – Uses a two-way information flow between data producers and data users Data may not be perfectly accurate, but it can be useful to some extent given applications and purposes; In other words, it has a quality (fitness-for-use); To determine data quality, errors should be properly documented (→ data quality) Levels of error assessment • Lineage report (descriptive) – Most primitive level of error assessment, but essential to understanding the characteristics of data – Provides data source & processing steps – Algorithm used for mathematical transformation, geographic scale of source data, currentness of data • Deductive estimates (descriptive) – Guess (extrapolation) based on sample testing; calibration test – Should be as numeric as possible • Internal evidence (quantitative measure) – Provides the result of error propagation analysis; – What is the impact of parameters in input data on output product or analysis results? • e.g. modifiable areal unit problem • External source (quantitative measure) – If “true value” is believed to be existent, – Report on discrepancy between observed value and true values Read C4 p. 45 (Levels of Testing) or C3 (SDTS data quality section) Matrix of spatial data quality Space Time Accuracy Theme CSDGM: •Positional accuracy •Attribute accuracy Consistency •Logical consistency Completeness •Completeness Row: components of data quality Column: components of geographic information Components of spatial data quality: Accuracy • Accuracy is the inverse of error • Many people equate accuracy with quality but in fact accuracy is one component of quality • An error is a discrepancy between the observed value and true value • What if true value is not existent? • Relative accuracy may suffice in some cases – Land parcel area • Absolute accuracy may be required in some cases – Exact geodetic coordinate value Read SDTS data quality section (reading C3) Components of spatial data quality: Accuracy • Can be divided into spatial, temporal, and thematic accuracy • Spatial accuracy and thematic accuracy are recognized in CSDGM: positional accuracy and attribute accuracy • If data is measured in the quantitative scale, RMSE can used to report error estimates • If data is measured in the qualitative scale, misclassification matrix can be used to report error estimates Components of spatial data quality: Consistency • • • • Absence of apparent contradictions in a database The fidelity of relationships encoded in the data structure Internal validity of a database Spatial consistency includes conformance to topological rules • Temporal consistency is related to temporal topology (some event can occur at a given location at a given time) • Thematic consistency refers to a lack of contradiction in attributes (e.g. density = population / area; they should agree with all; e.g. Erie county is not a part of Washington State) Components of spatial data quality: Completeness • A lack of errors of omission in a database • Relationship between the object represented and the abstract universe of all such objects • To define the abstract universe, we need specification (e.g. what constitutes hospital, housing, forest?) Data Spec Abstract universe • Two kinds of completeness – Data completeness: data relative to specification – Model completeness: spec. relative to abstract universe – If highly generalized data can be data complete if they contain all of the objects described in the specification – A data is model complete if its specification is appropriate for a given application Components of spatial data quality: Completeness • Within data completeness, – Errors resulting in overcompleteness are called errors of commission – Errors resulting in incompleteness are called errors of omission • Can be divided into spatial, temporal and thematic completeness • For example, database of building in Washington state as of January 2003 – e.g. Spatial incompleteness: data has only one building – e.g. Temporal incompleteness: data includes building placed by 2000 – e.g. Thematic incompleteness: data has only residential building