Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept. The Data Deluge Computational science is increasingly data intense and getting more so. Why? • More complex computations: – – – Nested model runs Linked models Finer resolution • More sources of data products – Observational data products • Streaming continuously from hundreds of sensor and network sources, scaling to thousands • Large archives – – – – – Annotations Model configuration parameters Output results Model data Statistical data (e.g., data mining) 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 2 Problem Computational scientists are reaching their limit on ability to manage data products associated with investigations – Scientist can touch hundreds to thousands of data products in single investigation 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 3 Seeds of solution in Internet? • Internet has proven the utility of user-oriented view towards information space management – Search, tag: browser, bookmarks – Publish: blogs, web page tools • But web not completely appropriate. Web is – Single-writer, multiple reader, and – Search-and-download. • Apply concept of user-oriented view to managing data space • Want ability to work locally. – myLEAD: tool to help an investigator make sense of, and operate in, the vast information space that is computational science (e.g., mesoscale meteorology.) 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 4 Personal metadata catalog requirements Scientists have following needs: • Want to share products but retain control over what gets shared and with whom – Data not made public until results appear in journal • Want rich search criteria over vast data space but don’t necessarily want to write SQL queries • Need help managing products generated over extended period of time (I.e., years) • Want high level of reliability - data must always be accessible, 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 5 Distributed and replicated personal metadata catalogues IU UA Huntsville UCAR Unidata Millersville Satellite myLEAD catalog 2005-03-07T18:00-05:00 Okla Univ NCSA Master myLEAD catalog Networks & Complex Systems Seminar Talk -- distribution: users partitioned over 6 sites in LEAD testbed -- replication: master is replica site for all 6 satellites User Bob’s workspace in 1998 User Bob’s workspace in 2002 Hurricane Ivan Hurricane Ivan SE quadrant SE quadrant User Bob’s workspace in 2003 Hurricane Ivan SE quadrant Voltice study 1998 Voltice study 1998 Voltice study 1998 Voltice study 2002 Voltice study 2003 Input parameter Voltice study 2002 Workflow template Input parameter Input parameter Collection Workflow template Workflow template Collection Physical data storage Collection Metadata Catalog Table of User Table of collection Table of file ftp://fileserver.org/file1998o768 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 7 Ontologies aid in querying sharing Does not know existence Sharing Non-preserved data product Flat structure Depth 3: brow sable Structure Depth 2: searchable structure Non-published Data products of other users preservation 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk Ontologies provide -- transparent structure -- controlled vocabulary 8 LEAD (http://lead.ou.edu) • Each year, mesoscale weather – floods, tornadoes, hail, strong winds, lightning, and winter storms – causes hundreds of deaths, routinely disrupts transportation and commerce, and results in annual economic losses > $13B. 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 9 Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 10 Conventional Numerical Weather Prediction OBSERVATIONS Analysis/Assimilation Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 11 Conventional Numerical Weather Prediction OBSERVATIONS Analysis/Assimilation Prediction Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields PCs to Teraflop Systems 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 12 Conventional Numerical Weather Prediction OBSERVATIONS Analysis/Assimilation Prediction Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields PCs to Teraflop Systems 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk Product Generation, Display, Dissemination 13 Conventional Numerical Weather Prediction OBSERVATIONS Analysis/Assimilation Prediction Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields PCs to Teraflop Systems Product Generation, Display, Dissemination End Users 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk NWS Private Companies 14 Students Conventional Numerical Weather Prediction OBSERVATIONS Analysis/Assimilation Prediction Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields PCs to Teraflop Systems Product Generation, Display, Dissemination The process is entirely serial and pre-scheduled: no response to weather! End Users 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk NWS Private Companies 15 Students The LEAD Vision: No Longer Serial or Static OBSERVATIONS Analysis/Assimilation Prediction Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields PCs to Teraflop Systems Product Generation, Display, Dissemination End Users 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk NWS Private Companies 16 Students The LEAD Vision: No Longer Serial or Static OBSERVATIONS Analysis/Assimilation Prediction Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields PCs to Teraflop Systems Product Generation, Display, Dissemination End Users 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk NWS Private Companies 17 Students 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 18 Objective discussed in this talk: • Grow the value of the data holdings. Can do so through provenance: workflow Process, time, causality myLEAD time 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 19 Exploiting Provenance Metadata Contents of Talk • Importance of Provenance • Techniques for Provenance Management • Data Quality and Provenance • Conclusion 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 21 Data Provenance • Derivation History of Data starting from its original sources • Data: Files, tables, tuples, virtual collections • Derivation: Process that transforms data – Script, Web service, Queries, Commands • Lineage, Pedigree, Genealogy, Filiation, Parentage, … 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 22 A Simple Provenance DAG D3 D4 P2 P3 D1 D2 D2’ P1 D0’ 2005-03-07T18:00-05:00 D0 Networks & Complex Systems Seminar Talk 23 Importance of Provenance • Scientific Domain – Publications are Provenance! – Many scientific datasets available online • Biology, Astronomy (SDSS) – Standard metadata describes datasets in well-known repositories – Lineage information usually missing, but vital – GIS: Fitness for use – Material Engineering: Pedigree, Auditing – Biology: Citation & copyright, trust – Astronomy: Context information 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 24 Importance of Provenance • Business Domain – Data warehousing: Integrated view over historical data from multiple sources – Complex transformations to generate normalized view (ETL) – Business analytics and intelligence (OLAP queries) – Lineage allows “drill-down” from view to source table – Allows tracing back sources of errors – “View deletion” problem 2005-03-07T18:00-05:00 T1 T2 Q2 Q3 V1 V2 Networks & Complex Systems Seminar Talk Source Tables Extract Transform P1 Load V0 View Data 25 Application of Provenance • Data Quality – Evaluate quality of data – Trust in the source of data – Use provenance and metadata information to estimate data quality for a user – Assertions and Signatures for provenance guarantee • Audit Trail – Error detection – Usage log 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 26 Application of Provenance • Replication Recipe – – – – Provenance can be recipe for generating a dataset Repeat to verify/compare Recreate/replicate Partial updates • Attribution – Copyright, citation, check data users • Informational – Discover datasets – Browse provenance 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 27 Subject of Provenance • What is provenance about? • Granularity – Attribute, tables, files, data collections Fine-grained vs. Coarse-grained – Trade-off with cost of collecting, storing, querying • Data vs. Process Provenance – Provenance can be a graph of data & processes – Which of them is provenance focused upon? – Hybrid where all grouped together 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 28 Process vs. Data Oriented D3 D4 D3 D4 P2 P3 P2 P3 D1 D2 D1 D2 D2’ P1 D0’ D0 2005-03-07T18:00-05:00 D2’ P1 D0’ Networks & Complex Systems Seminar Talk D0 29 Data Processing Architectures • Service Oriented Architecture – Grid & Web services – Workflow & Service invocations – Data as parameters, references • Databases – Update/View Queries, Stored Procedure Calls – Views, Tables, tuples, attributes • Scripting, Command-line, etc. 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 30 Scheme for Representing Provenance • Scheme for representing provenance – Annotations vs. Inversion • Annotation – Annotate data with ancestral data & the steps used to derive it e.g. a DAG – Annotation requires more storage; “Eager” – Annotation can be as rich as user decides • Inversion – Store function (query) used to generate data and invert it – Not all functions are invertible; auxiliary data required; JIT computation; query optimization – Minimal information provided (“Where”, “Why”) 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 31 Syntactic vs. Semantic Representation of Provenance • Syntactic Structure – XML for Annotations – Implement specific for Inversion • Semantic Knowledge – Semantic language used to define lineage metadata • RDF, OWL – Advantages • Provides Context • Enhance searches • Lineage proofs – Ontologies used as a framework for semantic knowledge – Community effort needed! 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 32 Provenance Storage • Stored with or separate from data? – Integrity, accessibility • Maintenance – Mutability, versioning – who is responsible – data creator or central? • Scalability – # of datasets, depth of lineage, granularity, geographical distribution, # of users – Inversion vs. Annotation; Distributed vs. Centralized • Overhead – Collection & storage – Automation 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 33 Provenance Dissemination • Browsing Provenance as a DAG – Go back and forward in lineage through GUI • Query based on lineage – By source data, or generating process – Enhanced by semantic information – Drill down during data mining • Verify how data was created by reenactment or present proof statements 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 34 Taxonomy in Brief • Application of Provenance Data quality Replication Recipe Audit trail Informational Attribution • Subject of Provenance Data vs. Process Granularity • Representation of Provenance Annotation vs. Inversion Syntactic vs. Semantic Contents • Provenance Storage Scalability Overhead • Provenance Dissemination 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 35 Data Quality for Scientific Data • Fitness for use • Subjective & Objective Parameters – believability, reputation, reliability – precision, timeliness, accuracy • Intrinsic Quality of data vs. Quality of data service – Correctness, consistency – accessibility, throughput, availability • Good quality for one application may not be good for another (user driven) 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 36 Estimating Data Quality from Provenance • Hypothesis: For derived datasets, quality depends not just on the dataset but also on its provenance — ancestral processes and data • Quality of a dataset could be a function of: – – – – Attributes of dataset Attributes of generating process Ancestral Datasets used to derive this dataset And so on recursively … 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 37 Weighted DAG? D4 = f(D3) D3 D4 P2_q = f(P2, D3_q) P2 P3 D1_q = f(D1, P2_q) D1 D2 D4_q = f(D4) P3_q = f(P3, D4_q) D2’ D2_q = f(D2, P3_q) D0’ 2005-03-07T18:00-05:00 P1 P1_q = f(P1, D1_q, D2_q, D4_q) D0 D0_q = f(D0, P1_q) Networks & Complex Systems Seminar Talk 38 Challenges for Quality Metrics • Some process may produce better quality data than its input dataset • Subsetting, aggregation of data may change overall quality estimate • Quality of transformation may be parameter dependent • Multiple user profiles for different applications • Missing lineage information can short-circuit measurement 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 39 Uses of Data Quality Measurement • Comparing and rank datasets uniformly – Google Personalized • Reduce search space to datasets matching user quality requirement • Built community-wide quality feedback mechanism – Leverage knowledge of domain expert – Promote publication of better quality data – Amazon reviews? 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 40 Research Questions • What are the metrics for estimating the quality for data using provenance? • How do we optimize user-centric searches based on quality? • How can we recover information from incomplete lineage? 2005-03-07T18:00-05:00 Networks & Complex Systems Seminar Talk 41 Thank you! Questions | Comments