Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale,

advertisement
Metadata, Ontologies, and Provenance: Towards
Extended Forms of Data Management
Beth Plale,
Yogesh Simmhan
Computer Science Dept.
The Data Deluge
Computational science is increasingly data intense and
getting more so. Why?
• More complex computations:
–
–
–
Nested model runs
Linked models
Finer resolution
• More sources of data products
–
Observational data products
• Streaming continuously from hundreds of sensor and network sources, scaling to
thousands
• Large archives
–
–
–
–
–
Annotations
Model configuration parameters
Output results
Model data
Statistical data (e.g., data mining)
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
2
Problem
Computational scientists are reaching their limit on ability
to manage data products associated with investigations
– Scientist can touch hundreds to thousands of data products in
single investigation
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
3
Seeds of solution in Internet?
• Internet has proven the utility of user-oriented view
towards information space management
– Search, tag: browser, bookmarks
– Publish: blogs, web page tools
• But web not completely appropriate. Web is
– Single-writer, multiple reader, and
– Search-and-download.
• Apply concept of user-oriented view to managing data
space
• Want ability to work locally.
– myLEAD: tool to help an investigator make sense of, and
operate in, the vast information space that is computational
science (e.g., mesoscale meteorology.)
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
4
Personal metadata catalog requirements
Scientists have following needs:
• Want to share products but retain control over what
gets shared and with whom
– Data not made public until results appear in journal
• Want rich search criteria over vast data space but don’t
necessarily want to write SQL queries
• Need help managing products generated over extended
period of time (I.e., years)
• Want high level of reliability - data must always be
accessible,
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
5
Distributed and replicated personal metadata
catalogues
IU
UA
Huntsville
UCAR
Unidata
Millersville
Satellite
myLEAD
catalog
2005-03-07T18:00-05:00
Okla
Univ
NCSA
Master
myLEAD
catalog
Networks & Complex Systems
Seminar Talk
-- distribution: users
partitioned over 6
sites in LEAD testbed
-- replication: master is
replica site for all
6
satellites
User Bob’s workspace in 1998
User Bob’s workspace in 2002
Hurricane Ivan
Hurricane Ivan
SE quadrant
SE quadrant
User Bob’s workspace in 2003
Hurricane Ivan
SE quadrant
Voltice study 1998
Voltice study 1998
Voltice study 1998
Voltice study 2002
Voltice study 2003
Input parameter
Voltice study 2002
Workflow template
Input parameter
Input parameter
Collection
Workflow template
Workflow template
Collection
Physical data storage
Collection
Metadata Catalog
Table of User
Table of collection
Table of file
ftp://fileserver.org/file1998o768
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
7
Ontologies aid in querying
sharing
Does not know
existence
Sharing
Non-preserved data product
Flat structure
Depth 3: brow sable
Structure
Depth 2: searchable
structure
Non-published Data products of other users
preservation
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
Ontologies provide
-- transparent structure
-- controlled vocabulary
8
LEAD
(http://lead.ou.edu)
• Each year, mesoscale weather – floods, tornadoes, hail,
strong winds, lightning, and winter storms – causes
hundreds of deaths, routinely disrupts transportation and
commerce, and results in annual economic losses > $13B.
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
9
Conventional Numerical Weather Prediction
OBSERVATIONS
Radar Data
Mobile Mesonets
Surface Observations
Upper-Air Balloons
Commercial Aircraft
Geostationary and Polar Orbiting
Satellite
Wind Profilers
GPS Satellites
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
10
Conventional Numerical Weather Prediction
OBSERVATIONS
Analysis/Assimilation
Radar Data
Mobile Mesonets
Surface Observations
Upper-Air Balloons
Commercial Aircraft
Geostationary and Polar Orbiting
Satellite
Wind Profilers
GPS Satellites
Quality Control
Retrieval of Unobserved
Quantities
Creation of Gridded Fields
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
11
Conventional Numerical Weather Prediction
OBSERVATIONS
Analysis/Assimilation
Prediction
Radar Data
Mobile Mesonets
Surface Observations
Upper-Air Balloons
Commercial Aircraft
Geostationary and Polar Orbiting
Satellite
Wind Profilers
GPS Satellites
Quality Control
Retrieval of Unobserved
Quantities
Creation of Gridded Fields
PCs to Teraflop Systems
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
12
Conventional Numerical Weather Prediction
OBSERVATIONS
Analysis/Assimilation
Prediction
Radar Data
Mobile Mesonets
Surface Observations
Upper-Air Balloons
Commercial Aircraft
Geostationary and Polar Orbiting
Satellite
Wind Profilers
GPS Satellites
Quality Control
Retrieval of Unobserved
Quantities
Creation of Gridded Fields
PCs to Teraflop Systems
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
Product Generation,
Display,
Dissemination
13
Conventional Numerical Weather Prediction
OBSERVATIONS
Analysis/Assimilation
Prediction
Radar Data
Mobile Mesonets
Surface Observations
Upper-Air Balloons
Commercial Aircraft
Geostationary and Polar Orbiting
Satellite
Wind Profilers
GPS Satellites
Quality Control
Retrieval of Unobserved
Quantities
Creation of Gridded Fields
PCs to Teraflop Systems
Product Generation,
Display,
Dissemination
End Users
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
NWS
Private Companies
14
Students
Conventional Numerical Weather Prediction
OBSERVATIONS
Analysis/Assimilation
Prediction
Radar Data
Mobile Mesonets
Surface Observations
Upper-Air Balloons
Commercial Aircraft
Geostationary and Polar Orbiting
Satellite
Wind Profilers
GPS Satellites
Quality Control
Retrieval of Unobserved
Quantities
Creation of Gridded Fields
PCs to Teraflop Systems
Product Generation,
Display,
Dissemination
The process is entirely serial
and pre-scheduled: no response
to weather!
End Users
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
NWS
Private Companies
15
Students
The LEAD Vision: No Longer Serial or Static
OBSERVATIONS
Analysis/Assimilation
Prediction
Radar Data
Mobile Mesonets
Surface Observations
Upper-Air Balloons
Commercial Aircraft
Geostationary and Polar Orbiting
Satellite
Wind Profilers
GPS Satellites
Quality Control
Retrieval of Unobserved
Quantities
Creation of Gridded Fields
PCs to Teraflop Systems
Product Generation,
Display,
Dissemination
End Users
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
NWS
Private Companies
16
Students
The LEAD Vision: No Longer Serial or Static
OBSERVATIONS
Analysis/Assimilation
Prediction
Radar Data
Mobile Mesonets
Surface Observations
Upper-Air Balloons
Commercial Aircraft
Geostationary and Polar Orbiting
Satellite
Wind Profilers
GPS Satellites
Quality Control
Retrieval of Unobserved
Quantities
Creation of Gridded Fields
PCs to Teraflop Systems
Product Generation,
Display,
Dissemination
End Users
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
NWS
Private Companies
17
Students
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
18
Objective discussed in this talk:
• Grow the value of the data holdings. Can do so
through provenance:
workflow
Process, time,
causality
myLEAD
time
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
19
Exploiting Provenance Metadata
Contents of Talk
• Importance of Provenance
• Techniques for Provenance Management
• Data Quality and Provenance
• Conclusion
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
21
Data Provenance
• Derivation History of Data starting from its
original sources
• Data: Files, tables, tuples, virtual collections
• Derivation: Process that transforms data –
Script, Web service, Queries, Commands
• Lineage, Pedigree, Genealogy, Filiation,
Parentage, …
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
22
A Simple Provenance DAG
D3
D4
P2
P3
D1
D2
D2’
P1
D0’
2005-03-07T18:00-05:00
D0
Networks & Complex Systems
Seminar Talk
23
Importance of Provenance
• Scientific Domain
– Publications are Provenance!
– Many scientific datasets available online
• Biology, Astronomy (SDSS)
– Standard metadata describes datasets in well-known
repositories
– Lineage information usually missing, but vital
– GIS: Fitness for use
– Material Engineering: Pedigree, Auditing
– Biology: Citation & copyright, trust
– Astronomy: Context information
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
24
Importance of Provenance
• Business Domain
– Data warehousing: Integrated
view over historical data from
multiple sources
– Complex transformations to
generate normalized view (ETL)
– Business analytics and
intelligence (OLAP queries)
– Lineage allows “drill-down”
from view to source table
– Allows tracing back sources of
errors
– “View deletion” problem
2005-03-07T18:00-05:00
T1
T2
Q2
Q3
V1
V2
Networks & Complex Systems
Seminar Talk
Source Tables
Extract
Transform
P1
Load
V0
View Data
25
Application of Provenance
• Data Quality
– Evaluate quality of data
– Trust in the source of data
– Use provenance and metadata information to
estimate data quality for a user
– Assertions and Signatures for provenance guarantee
• Audit Trail
– Error detection
– Usage log
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
26
Application of Provenance
• Replication Recipe
–
–
–
–
Provenance can be recipe for generating a dataset
Repeat to verify/compare
Recreate/replicate
Partial updates
• Attribution
– Copyright, citation, check data users
• Informational
– Discover datasets
– Browse provenance
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
27
Subject of Provenance
• What is provenance about?
• Granularity
– Attribute, tables, files, data collections Fine-grained
vs. Coarse-grained
– Trade-off with cost of collecting, storing, querying
• Data vs. Process Provenance
– Provenance can be a graph of data & processes
– Which of them is provenance focused upon?
– Hybrid where all grouped together
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
28
Process vs. Data Oriented
D3
D4
D3
D4
P2
P3
P2
P3
D1
D2
D1
D2
D2’
P1
D0’
D0
2005-03-07T18:00-05:00
D2’
P1
D0’
Networks & Complex Systems
Seminar Talk
D0
29
Data Processing Architectures
• Service Oriented Architecture
– Grid & Web services
– Workflow & Service invocations
– Data as parameters, references
• Databases
– Update/View Queries, Stored Procedure Calls
– Views, Tables, tuples, attributes
• Scripting, Command-line, etc.
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
30
Scheme for Representing Provenance
• Scheme for representing provenance
– Annotations vs. Inversion
• Annotation
– Annotate data with ancestral data & the steps used to derive it
e.g. a DAG
– Annotation requires more storage; “Eager”
– Annotation can be as rich as user decides
• Inversion
– Store function (query) used to generate data and invert it
– Not all functions are invertible; auxiliary data required; JIT
computation; query optimization
– Minimal information provided (“Where”, “Why”)
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
31
Syntactic vs. Semantic Representation of
Provenance
• Syntactic Structure
– XML for Annotations
– Implement specific for Inversion
• Semantic Knowledge
– Semantic language used to define lineage metadata
• RDF, OWL
– Advantages
• Provides Context
• Enhance searches
• Lineage proofs
– Ontologies used as a framework for semantic knowledge
– Community effort needed!
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
32
Provenance Storage
• Stored with or separate from data?
– Integrity, accessibility
• Maintenance
– Mutability, versioning
– who is responsible – data creator or central?
• Scalability
– # of datasets, depth of lineage, granularity, geographical
distribution, # of users
– Inversion vs. Annotation; Distributed vs. Centralized
• Overhead
– Collection & storage
– Automation
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
33
Provenance Dissemination
• Browsing Provenance as a DAG
– Go back and forward in lineage through GUI
• Query based on lineage
– By source data, or generating process
– Enhanced by semantic information
– Drill down during data mining
• Verify how data was created by reenactment or
present proof statements
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
34
Taxonomy in Brief
• Application of Provenance
Data quality
Replication Recipe
Audit trail
Informational
Attribution
• Subject of Provenance
Data vs. Process
Granularity
• Representation of Provenance
Annotation vs. Inversion
Syntactic vs. Semantic
Contents
• Provenance Storage
Scalability
Overhead
• Provenance Dissemination
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
35
Data Quality for Scientific Data
• Fitness for use
• Subjective & Objective Parameters
– believability, reputation, reliability
– precision, timeliness, accuracy
• Intrinsic Quality of data vs. Quality of data
service
– Correctness, consistency
– accessibility, throughput, availability
• Good quality for one application may not be
good for another (user driven)
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
36
Estimating Data Quality from Provenance
• Hypothesis: For derived datasets, quality
depends not just on the dataset but also on its
provenance — ancestral processes and data
• Quality of a dataset could be a function of:
–
–
–
–
Attributes of dataset
Attributes of generating process
Ancestral Datasets used to derive this dataset
And so on recursively …
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
37
Weighted DAG?
D4 = f(D3)
D3
D4
P2_q = f(P2, D3_q)
P2
P3
D1_q = f(D1, P2_q)
D1
D2
D4_q = f(D4)
P3_q = f(P3, D4_q)
D2’
D2_q = f(D2, P3_q)
D0’
2005-03-07T18:00-05:00
P1
P1_q = f(P1, D1_q, D2_q, D4_q)
D0
D0_q = f(D0, P1_q)
Networks & Complex Systems
Seminar Talk
38
Challenges for Quality Metrics
• Some process may produce better quality data
than its input dataset
• Subsetting, aggregation of data may change
overall quality estimate
• Quality of transformation may be parameter
dependent
• Multiple user profiles for different applications
• Missing lineage information can short-circuit
measurement
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
39
Uses of Data Quality Measurement
• Comparing and rank datasets uniformly
– Google Personalized
• Reduce search space to datasets matching user
quality requirement
• Built community-wide quality feedback
mechanism
– Leverage knowledge of domain expert
– Promote publication of better quality data
– Amazon reviews?
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
40
Research Questions
• What are the metrics for estimating the quality
for data using provenance?
• How do we optimize user-centric searches
based on quality?
• How can we recover information from
incomplete lineage?
2005-03-07T18:00-05:00
Networks & Complex Systems
Seminar Talk
41
Thank you!
Questions | Comments
Download