Using Metadata to Link Uncertainty and Data Quality

advertisement
Using Metadata to Link
Uncertainty
and
Data Quality
Richard Wadsworth
Ships that pass in the night and speak each other in passing;
Only a signal shown and a distant voice in the darkness;
So on the ocean of life we pass and speak one another,
Only a look and a voice; then darkness again and a silence.
Tales of a Wayside Inn. Part iii.
Henry Wadsworth Longfellow. (1807–1882)
Data is a only a representation of a phenomena…
Interested in …
• Information not Data
• Users not Producers
• What Metadata ought to be doing not what it
is doing
• Exploiting data not discovering data
• Using land cover as an example
Information v. Reality
“Truth, as in a single, incontrovertible and correct fact, simply
does not exist for much geographical information”
• Real world infinitely complex
• All representations involve …
– Abstraction, Aggregation, Simplification etc.
• Choices about representation depend on
– Commissioning variation (who “paid” for it?)
– Observer variation (what did you see?)
– Institutional variation (why do you see it that way?)
– Representational variation (how did you record it?)
Geographic Objects
Geographic objects
Well defined objects
Buildings,
Roads,
…
Poorly defined objects
Ambiguous objects
Vague objects
Mountains,
Sand dunes,
…
Discordant objects
Forests,
Bogs,
…
Non-specific objects
Rural
England,
improved
grassland
…
An Example …land cover
• Two maps (LCMGB, LCM2000), produced 10 years
apart, by the same people, using the same basic
approach (automatic classification of Landsat ETM+
data)
• In the OS SK tile (100 x 100km around Leicester)
• In 1990
<1 ha of “Bog” (12 pixels).
• In 2000 >7,500 ha of “Bog” (120,728 pixels)
• Was this global change? (probably not …)
Commissioning context - LCMGB
DoE
Remote Sensing
community
ITE / CEH EOS
(Monkswood)
BNSC
Users
Satellite Images
LCM1990
1990 field
survey
Input
Accountable Actors
Network of actors and their links:
money
skills
information
control
Output
Commissioning context LCM2000
ITE EOS / Clevermapping
LCM2000
Methodology
Laserscan
Policy
LCM1990
Issues
Agencies /
Users (1990
& 2000)
Users
LCM2000
Steering Group
LCM2000
DETR / DEFRA
2000 Field survey
Input
Network of actors and their links:
money
skills
Accountable Actors
information
control
Output
Classes “cluster” in attribute space
Classes and relationships change
Specific outcome for “Bog”
In 1990 “Bog” was a land cover defined by what could be
seen:
“...permanent waterlogging, … depositions of acidic peat …
…permanent or temporary standing water …
...water-logging, perhaps with surface water, …”
In 2000 “Bog” was a ‘priority habitat’ and identification
needed ancillary data:
“… in areas with peat >0.5 m deep …”
(no reference to water).
Another example - What is a forest?
http://home.comcast.net/~gyde/DEFpaper.htm
16
Zimbabwe
14
Tree Height (m)
12
10
Sudan
8
Turkey
Tanzania
Mozambique
Morocco
FAO -FRA 2000
6
PNG
New Zealand
Luxembourg
Malaysia
Netherlands
Namibia
Belgium
Somalia
4
Israel
United States
Gambia
Mexico
Ethiopia
Denmark
SADC
Cambodia
Australia
Japan
Jamaica
UNESCO
Switzerland
2
South Africa
Kyrgyzstan
Kenya
Portugal
Estonia
0
0
10
20
30
40
50
Canopy Cover (%)
60
70
80
90
FAO - Forest Resource Assessments
Spatial characteristics also changed
So let’s standardise everything?
Standards organisations:
Want their standard to be
adopted
Producers: Want to
show they can
follow a recipe
Users: Want
reassurance
Mediators: Want to
show the data to be of
merchantable quality
Standards – an analogy
Your car
– Standards are created (you must have an MOT)
– Producers ensures their cars conform
– Mediators (sellers) advertise compliance with the standard
– User (buyer) is reassured that the car is ok
BUT people
• Buy an AA assessment of a particular car
• Use “Which Report” (or Jeremy Clarkson?) to understand
whether the type of car is Useful not just Useable
For data there is no “Which Report”
Data Quality Standards
Once dominated by the national mapping agencies and software
companies, now dominated by ISO, the Open GIS Consortium
etc.
The ‘big 5’ of geo-spatial data quality standards:
• Positional Accuracy,
• Attribute Accuracy,
• Logical Consistency,
• Completeness,
• Lineage.
Salgé (1995) tried to introduce the concept of semantic accuracy
but has largely been ignored.
Data Quality v. geographic objects
Nature of Geographic Reality
Well defined
objects
Measures of
Data Quality
Poorly defined objects
Vague
objects
Ambiguity
Discordant
objects
Non-specific
objects
Positional
accuracy
Yes
Yes, but…
No
No
Attribute
accuracy
Yes
Yes, but…
No
No
Logical
consistency
Yes
Yes, but…
No
No
Completeness
Yes
Yes, but…
No
No
Lineage
Yes
Yes
Yes
Yes
Uncertainty v. geographic objects
Nature of geographic Reality
Well
defined
objects
Techniques to
process
uncertainty
Probability*
Poorly defined objects
Vague
objects
Ambiguity
Discordant
objects
Non-specific
objects
Yes
No
No
No
Fuzzy sets
(Yes)
Yes
Yes
?
DempsterShafer
(Yes)
Yes
Yes
?
Endorsement
Theory
(Yes)
(Yes)
Yes
Yes
*Including: Monte Carlo, bootstrapping, conditional simulations, frequency, confusion matrices etc.
What can be done?
IF you stretch “Metadata” to include:
Scientific and policy background (context)
Organisational and institutional origins of the
conceptualisation (ontology)
How were objects measured (epistemology);
How were classes specified (semantics);
Then …
Semantic-Statistical Comparisons
Fen (11.1)
Montane (15.1)
One experts
opinion of the
semantic
relationship
between
classes in two
land cover
maps.
lowlandbog
Bog (12.1)
Rough
uplandbog
Acid (8.1)
Grasslands (5.1)
MoorGrass
Bracken
Suburban
Neutral (6.1)
Meadow
GrassHeath
Bracken (9.1)
Suburban (17.1)
Calcareous (7.1)
Mown
Saltmarsh
Open heath (10.2)
felled
Littoral sed
S-littoral rock
DenseMoor
OpenMoor
Inland Bare (16.1)
coastalbare
OpenHeath
DenseHeath
InlandBare
S-littoral sed
Littoral rock
Ruderal
Dense heath (10.1)
(From “blue”
to “red”.
“Expected” and
“uncertain”
relationships)
Sea (22.1)
Arable (4.3)
Arable (4.1)
Setaside (5.2)
Tilled
Scrub
Sea
Arable (4.2)
Urban
Broadleaved (1.1)
Water (13.1)
Conifer
Coniferous (2.1)
Deciduous
Water
Urban (17.2)
Assume landscape consists of segments
For class A:
expected score = 18,
uncertain score = 7 (4 class B pixels + 3 class C pixels)
unexpected score = 1 (the single pixel of class D).
Segment in second classification
For class A:
expected score = 19 (class X),
uncertain score = 5 (class Z)
unexpected score = 2 (class Y).
Combine Scores
Scores are treated as if they were probabilities then using Dempster-Shafer:
Belief = (Bel1.Bel2 + Unc1.Bel2 + Unc2.Bel1) / β
where β = (1 – Bel1.Dis2 – Bel2.Dis1)
Bel1 & Bel2 = the beliefs (expected),
Unc1 & Unc2 = uncertainties (uncertain),
Dis1 & Dis2 = disbeliefs (unexpected).
For class A.
Bel1 = 18/26 = 0.692, Unc1 = 7/26 = 0.269, Dis1 = 1/26 = 0.038
Bel2 = 19/26 = 0.731, Unc2 = 2/26 = 0.077, Dis2 = 5/26 = 0.192
Therefore:
β = 1 – 0.692*0.192 – 0.731*0.038 = 0.839
Belief = (0.692*0.731 + 0.693*0.077 + 0.731*0.269) / 0.839 = 0.901
The belief has increased therefore we consider that the segment is consistent for A
Conclusions
• Understanding data meaning is increasingly important:
– Increased number of users
– Spatial Data Initiatives
– Decreased role of “old fashioned” but complete metadata (the survey
memoirs)
– Naive belief in technology as a solution (standards, inter-operability etc).
• Metadata needs to include:
- user experience
- producers understanding of the data
– origins of the information
– expanded Logical Consistency
Download