A statistical perspective on quality in GEOSS

advertisement
A statistical perspective on quality in
GEOSS: challenges and opportunities
Dan Cornford
d.cornford@aston.ac.uk
Aston University
Birmingham, UK
What is quality?
• Quality has several facets, but the key one in my view
is a quantitative statement of the relation of a value
to reality
• Quality is challenging to define
– ISO 9000:
• “Degree to which a set of inherent characteristics fulfils
requirements”
– International Association for Information and Data Quality:
• “Information Quality is not just “fitness for purpose;” it must be fit
for all purposes”
– Oxford English dictionary:
• “the standard of something as measured against other things of a
similar kind; the degree of excellence of something”
The key aspects of data quality
• An exhaustive list might include:
– Accuracy, Integrity, Precision, Objectivity, Completeness, Conciseness,
Redundancy, Validity, Consistency, Timeliness, Accessibility, Utility,
Usability , Flexibility, Traceability
• There is not universal agreement but we propose:
–
–
–
–
–
–
accuracy: value correctly represents the real world
completeness: degree of data coverage for a given region and time
consistency: are rules to which the data should conform met
usability: how easy is it to access and use the data
traceability: can one see how the results have arisen
utility: what is the user view of the data value to their use-case
Accuracy and what is reality for GEOSS
• Accuracy is the most important quality aspect
• This is not a talk about philosophy ...
– However, we must define objects in the real world using
mental concepts
• I view reality as a set of continuous space-time fields
of discrete or continuous valued variables
– The variables represent different properties of the system,
e.g. temperature, land cover
• A big challenge is that reality varies over almost all
space and time scales, so we need to be precise
about these when defining reality
Relating observations to reality - accuracy
• Assume we can define reality precisely
• I argue the most useful information I can have about an
observation is the relation of this observation to reality
• Express this relation mathematically as y = h(x) where
– y is my observation
– x is reality (not known)
– h() is my sensor / forward /observation model that maps reality
to what I can observe
• I can almost never write this due to various sources of
uncertainty in my observation and h, and maybe
variations in x. So I must write y = h(x) + ε(x)
– ε(x) is the (irreducible) observation uncertainty
Dealing with the unknown
• Uncertainty is a fundamental part of science
• Managing uncertainty is at the heart of data accuracy
• There are several frameworks for handling uncertainty:
–
–
–
–
–
frequentist probability (requires repeatability)
subjective Bayesian probability (personal, belief)
fuzzy methods (more relevant to semantics)
imprecise probabilities (you don’t have full distributions)
belief theory and other multi-valued representations
• Choosing one is a challenge
– I believe that subjective Bayesian approaches are a good
starting point
What I would really, really want
• You supply an observation y
• I would want to know
– h(x), the observation function
– ε(x), the uncertainty about y, defining p(y|x)
because then I can work out:
p(x|y) α p(y|x)p(x)
• This is the essence of Bayesian (probabilistic) logic –
updating my beliefs
• I need to know about how you define x too – what
spatial and temporal scales
Why does p(y|x) matter?
• Imagine I have other observations of x (reality)
– How do I combine these observations rationally and
optimally?
• Solution is to use the Bayesian updates – need to
know the joint structure of all the errors!
• This is optimistic (unknowable?) however, given
reality, it is likely that for many observations the
errors will be uncorrelated
What about a practical example?
• Consider weather forecasting:
– essentially an initial value problem – we need to know
about reality x at an initial time
– thus if we can define p(x|y) at the start of our forecast we
are good to go
– this is what is called data assimilation; combining different
observations requires good uncertainty characterisation
for each observation, p(y|x)
• In data assimilation we also need to know about the
observation model y = h(x) + ε(x) and its uncertainty
How do we get p(y|x)?
• This is where QA4EO comes in – has a strong
metrology emphasis where all sources of uncertainty
are identified
– this is very challenging – to define a complete probability
distribution requires many assumptions
• So we had better try and check our model, p(y|x)
using reference validation data, and assess the
reliability of the density
– reliability is used in a technical sense – it is a measure of
whether the probabilities estimated are really observed
Practicalities of obtaining p(y|x)
• This is not easy – I imagine a two pronged approach
– Lab based “forward” assessment of instrument
characteristics to build an initial uncertainty model
– Field based validation campaigns to update our beliefs
about the uncertainty, using a data assimilation like
approach
• Continual refinement based on ongoing validation
• This requires new statistical methods, new systems
and new software and is a challenge!
– ideally should integrate into data assimilation ...
How does this relate to current approaches?
• Current approaches in e.g. ISO19115 / -2 (metadata)
recognise quality as something important but:
– they do not give strong enough guidance on using useful
quality indicators (QA4EO addresses this to a greater
degree)
– many of the quality indicators are very esoteric and not
statistically well motivated (still true in ISO19157)
• We have built UncertML specifically to describe
flexibly but precisely uncertainty information in a
useable form to allow probabilistic solutions
Why probabilistic quality modelling?
• We need a precise, context independent definition of
the observation accuracy as a key part of quality
• Probabilistic approaches
– work for all uses of the observation – not context specific
– provide a coherent, principled framework for using
observations
– allow integration of data (information interoperability) and
data reuse
– can extract information from even noisy data – assuming
reliable probabilities any ‘quality’ of data can be used
Quality, metadata and the GEO label
• Quantitative probabilistic accuracy information is the
single most useful aspect
• Other aspects remain relevant:
–
–
–
–
–
traceability: provenance / lineage
usability: ease / cost of access
completeness: coverage
validity: conformance to internal and external rules
utility: user rating
• I think the GEO label concept must put a well defined
probabilistic notion of accuracy at its heart, but also
consider these other quality aspects
Summary
• Quality has many facets – accuracy is key
– Accuracy should be well defined requiring a rigorous
statistical framework, and a definition of reality
• Quality should be at the heart of a GEO label
– QA4EO is starting to show the way
• We need also to show how to implement this
• GeoViQua will develop some of the necessary tools,
but this is a long road ...
Download