Using Metadata to Link Uncertainty and Data Quality Richard Wadsworth Ships that pass in the night and speak each other in passing; Only a signal shown and a distant voice in the darkness; So on the ocean of life we pass and speak one another, Only a look and a voice; then darkness again and a silence. Tales of a Wayside Inn. Part iii. Henry Wadsworth Longfellow. (1807–1882) Data is a only a representation of a phenomena… Interested in … • Information not Data • Users not Producers • What Metadata ought to be doing not what it is doing • Exploiting data not discovering data • Using land cover as an example Information v. Reality “Truth, as in a single, incontrovertible and correct fact, simply does not exist for much geographical information” • Real world infinitely complex • All representations involve … – Abstraction, Aggregation, Simplification etc. • Choices about representation depend on – Commissioning variation (who “paid” for it?) – Observer variation (what did you see?) – Institutional variation (why do you see it that way?) – Representational variation (how did you record it?) Geographic Objects Geographic objects Well defined objects Buildings, Roads, … Poorly defined objects Ambiguous objects Vague objects Mountains, Sand dunes, … Discordant objects Forests, Bogs, … Non-specific objects Rural England, improved grassland … An Example …land cover • Two maps (LCMGB, LCM2000), produced 10 years apart, by the same people, using the same basic approach (automatic classification of Landsat ETM+ data) • In the OS SK tile (100 x 100km around Leicester) • In 1990 <1 ha of “Bog” (12 pixels). • In 2000 >7,500 ha of “Bog” (120,728 pixels) • Was this global change? (probably not …) Commissioning context - LCMGB DoE Remote Sensing community ITE / CEH EOS (Monkswood) BNSC Users Satellite Images LCM1990 1990 field survey Input Accountable Actors Network of actors and their links: money skills information control Output Commissioning context LCM2000 ITE EOS / Clevermapping LCM2000 Methodology Laserscan Policy LCM1990 Issues Agencies / Users (1990 & 2000) Users LCM2000 Steering Group LCM2000 DETR / DEFRA 2000 Field survey Input Network of actors and their links: money skills Accountable Actors information control Output Classes “cluster” in attribute space Classes and relationships change Specific outcome for “Bog” In 1990 “Bog” was a land cover defined by what could be seen: “...permanent waterlogging, … depositions of acidic peat … …permanent or temporary standing water … ...water-logging, perhaps with surface water, …” In 2000 “Bog” was a ‘priority habitat’ and identification needed ancillary data: “… in areas with peat >0.5 m deep …” (no reference to water). Another example - What is a forest? http://home.comcast.net/~gyde/DEFpaper.htm 16 Zimbabwe 14 Tree Height (m) 12 10 Sudan 8 Turkey Tanzania Mozambique Morocco FAO -FRA 2000 6 PNG New Zealand Luxembourg Malaysia Netherlands Namibia Belgium Somalia 4 Israel United States Gambia Mexico Ethiopia Denmark SADC Cambodia Australia Japan Jamaica UNESCO Switzerland 2 South Africa Kyrgyzstan Kenya Portugal Estonia 0 0 10 20 30 40 50 Canopy Cover (%) 60 70 80 90 FAO - Forest Resource Assessments Spatial characteristics also changed So let’s standardise everything? Standards organisations: Want their standard to be adopted Producers: Want to show they can follow a recipe Users: Want reassurance Mediators: Want to show the data to be of merchantable quality Standards – an analogy Your car – Standards are created (you must have an MOT) – Producers ensures their cars conform – Mediators (sellers) advertise compliance with the standard – User (buyer) is reassured that the car is ok BUT people • Buy an AA assessment of a particular car • Use “Which Report” (or Jeremy Clarkson?) to understand whether the type of car is Useful not just Useable For data there is no “Which Report” Data Quality Standards Once dominated by the national mapping agencies and software companies, now dominated by ISO, the Open GIS Consortium etc. The ‘big 5’ of geo-spatial data quality standards: • Positional Accuracy, • Attribute Accuracy, • Logical Consistency, • Completeness, • Lineage. Salgé (1995) tried to introduce the concept of semantic accuracy but has largely been ignored. Data Quality v. geographic objects Nature of Geographic Reality Well defined objects Measures of Data Quality Poorly defined objects Vague objects Ambiguity Discordant objects Non-specific objects Positional accuracy Yes Yes, but… No No Attribute accuracy Yes Yes, but… No No Logical consistency Yes Yes, but… No No Completeness Yes Yes, but… No No Lineage Yes Yes Yes Yes Uncertainty v. geographic objects Nature of geographic Reality Well defined objects Techniques to process uncertainty Probability* Poorly defined objects Vague objects Ambiguity Discordant objects Non-specific objects Yes No No No Fuzzy sets (Yes) Yes Yes ? DempsterShafer (Yes) Yes Yes ? Endorsement Theory (Yes) (Yes) Yes Yes *Including: Monte Carlo, bootstrapping, conditional simulations, frequency, confusion matrices etc. What can be done? IF you stretch “Metadata” to include: Scientific and policy background (context) Organisational and institutional origins of the conceptualisation (ontology) How were objects measured (epistemology); How were classes specified (semantics); Then … Semantic-Statistical Comparisons Fen (11.1) Montane (15.1) One experts opinion of the semantic relationship between classes in two land cover maps. lowlandbog Bog (12.1) Rough uplandbog Acid (8.1) Grasslands (5.1) MoorGrass Bracken Suburban Neutral (6.1) Meadow GrassHeath Bracken (9.1) Suburban (17.1) Calcareous (7.1) Mown Saltmarsh Open heath (10.2) felled Littoral sed S-littoral rock DenseMoor OpenMoor Inland Bare (16.1) coastalbare OpenHeath DenseHeath InlandBare S-littoral sed Littoral rock Ruderal Dense heath (10.1) (From “blue” to “red”. “Expected” and “uncertain” relationships) Sea (22.1) Arable (4.3) Arable (4.1) Setaside (5.2) Tilled Scrub Sea Arable (4.2) Urban Broadleaved (1.1) Water (13.1) Conifer Coniferous (2.1) Deciduous Water Urban (17.2) Assume landscape consists of segments For class A: expected score = 18, uncertain score = 7 (4 class B pixels + 3 class C pixels) unexpected score = 1 (the single pixel of class D). Segment in second classification For class A: expected score = 19 (class X), uncertain score = 5 (class Z) unexpected score = 2 (class Y). Combine Scores Scores are treated as if they were probabilities then using Dempster-Shafer: Belief = (Bel1.Bel2 + Unc1.Bel2 + Unc2.Bel1) / β where β = (1 – Bel1.Dis2 – Bel2.Dis1) Bel1 & Bel2 = the beliefs (expected), Unc1 & Unc2 = uncertainties (uncertain), Dis1 & Dis2 = disbeliefs (unexpected). For class A. Bel1 = 18/26 = 0.692, Unc1 = 7/26 = 0.269, Dis1 = 1/26 = 0.038 Bel2 = 19/26 = 0.731, Unc2 = 2/26 = 0.077, Dis2 = 5/26 = 0.192 Therefore: β = 1 – 0.692*0.192 – 0.731*0.038 = 0.839 Belief = (0.692*0.731 + 0.693*0.077 + 0.731*0.269) / 0.839 = 0.901 The belief has increased therefore we consider that the segment is consistent for A Conclusions • Understanding data meaning is increasingly important: – Increased number of users – Spatial Data Initiatives – Decreased role of “old fashioned” but complete metadata (the survey memoirs) – Naive belief in technology as a solution (standards, inter-operability etc). • Metadata needs to include: - user experience - producers understanding of the data – origins of the information – expanded Logical Consistency