A statistical perspective on quality in GEOSS: challenges and opportunities Dan Cornford d.cornford@aston.ac.uk Aston University Birmingham, UK What is quality? • Quality has several facets, but the key one in my view is a quantitative statement of the relation of a value to reality • Quality is challenging to define – ISO 9000: • “Degree to which a set of inherent characteristics fulfils requirements” – International Association for Information and Data Quality: • “Information Quality is not just “fitness for purpose;” it must be fit for all purposes” – Oxford English dictionary: • “the standard of something as measured against other things of a similar kind; the degree of excellence of something” The key aspects of data quality • An exhaustive list might include: – Accuracy, Integrity, Precision, Objectivity, Completeness, Conciseness, Redundancy, Validity, Consistency, Timeliness, Accessibility, Utility, Usability , Flexibility, Traceability • There is not universal agreement but we propose: – – – – – – accuracy: value correctly represents the real world completeness: degree of data coverage for a given region and time consistency: are rules to which the data should conform met usability: how easy is it to access and use the data traceability: can one see how the results have arisen utility: what is the user view of the data value to their use-case Accuracy and what is reality for GEOSS • Accuracy is the most important quality aspect • This is not a talk about philosophy ... – However, we must define objects in the real world using mental concepts • I view reality as a set of continuous space-time fields of discrete or continuous valued variables – The variables represent different properties of the system, e.g. temperature, land cover • A big challenge is that reality varies over almost all space and time scales, so we need to be precise about these when defining reality Relating observations to reality - accuracy • Assume we can define reality precisely • I argue the most useful information I can have about an observation is the relation of this observation to reality • Express this relation mathematically as y = h(x) where – y is my observation – x is reality (not known) – h() is my sensor / forward /observation model that maps reality to what I can observe • I can almost never write this due to various sources of uncertainty in my observation and h, and maybe variations in x. So I must write y = h(x) + ε(x) – ε(x) is the (irreducible) observation uncertainty Dealing with the unknown • Uncertainty is a fundamental part of science • Managing uncertainty is at the heart of data accuracy • There are several frameworks for handling uncertainty: – – – – – frequentist probability (requires repeatability) subjective Bayesian probability (personal, belief) fuzzy methods (more relevant to semantics) imprecise probabilities (you don’t have full distributions) belief theory and other multi-valued representations • Choosing one is a challenge – I believe that subjective Bayesian approaches are a good starting point What I would really, really want • You supply an observation y • I would want to know – h(x), the observation function – ε(x), the uncertainty about y, defining p(y|x) because then I can work out: p(x|y) α p(y|x)p(x) • This is the essence of Bayesian (probabilistic) logic – updating my beliefs • I need to know about how you define x too – what spatial and temporal scales Why does p(y|x) matter? • Imagine I have other observations of x (reality) – How do I combine these observations rationally and optimally? • Solution is to use the Bayesian updates – need to know the joint structure of all the errors! • This is optimistic (unknowable?) however, given reality, it is likely that for many observations the errors will be uncorrelated What about a practical example? • Consider weather forecasting: – essentially an initial value problem – we need to know about reality x at an initial time – thus if we can define p(x|y) at the start of our forecast we are good to go – this is what is called data assimilation; combining different observations requires good uncertainty characterisation for each observation, p(y|x) • In data assimilation we also need to know about the observation model y = h(x) + ε(x) and its uncertainty How do we get p(y|x)? • This is where QA4EO comes in – has a strong metrology emphasis where all sources of uncertainty are identified – this is very challenging – to define a complete probability distribution requires many assumptions • So we had better try and check our model, p(y|x) using reference validation data, and assess the reliability of the density – reliability is used in a technical sense – it is a measure of whether the probabilities estimated are really observed Practicalities of obtaining p(y|x) • This is not easy – I imagine a two pronged approach – Lab based “forward” assessment of instrument characteristics to build an initial uncertainty model – Field based validation campaigns to update our beliefs about the uncertainty, using a data assimilation like approach • Continual refinement based on ongoing validation • This requires new statistical methods, new systems and new software and is a challenge! – ideally should integrate into data assimilation ... How does this relate to current approaches? • Current approaches in e.g. ISO19115 / -2 (metadata) recognise quality as something important but: – they do not give strong enough guidance on using useful quality indicators (QA4EO addresses this to a greater degree) – many of the quality indicators are very esoteric and not statistically well motivated (still true in ISO19157) • We have built UncertML specifically to describe flexibly but precisely uncertainty information in a useable form to allow probabilistic solutions Why probabilistic quality modelling? • We need a precise, context independent definition of the observation accuracy as a key part of quality • Probabilistic approaches – work for all uses of the observation – not context specific – provide a coherent, principled framework for using observations – allow integration of data (information interoperability) and data reuse – can extract information from even noisy data – assuming reliable probabilities any ‘quality’ of data can be used Quality, metadata and the GEO label • Quantitative probabilistic accuracy information is the single most useful aspect • Other aspects remain relevant: – – – – – traceability: provenance / lineage usability: ease / cost of access completeness: coverage validity: conformance to internal and external rules utility: user rating • I think the GEO label concept must put a well defined probabilistic notion of accuracy at its heart, but also consider these other quality aspects Summary • Quality has many facets – accuracy is key – Accuracy should be well defined requiring a rigorous statistical framework, and a definition of reality • Quality should be at the heart of a GEO label – QA4EO is starting to show the way • We need also to show how to implement this • GeoViQua will develop some of the necessary tools, but this is a long road ...