Metadata Management and Cataloging Breakout Jim Myers, Line Pouchard Ann Chervenak, Richard Mount, Larry Rahn, Greg Riccardi, Sonja Tidemann, Steve Wiley Wrong Title? “I could do more science if I could: Automate workflow, Search my data faster. …” “My paper is too short, I need more metadata…” Drivers for Metadata • Extracting more value from data – Data useful beyond grad student lifetime – Single-user efficiency • Dealing with Moore’s Law, CS Advances – Managing more complex experiments – unique names is no longer sufficient – Componetization of Codes – (Decomposition of concerns (aspects)) Drivers for Metadata • Changing Science – moving beyond an oral tradition – Need to share context-dependent data across community(ies) (data dissemination/discovery) – Support mapping between data models (across domains, over time) – Managing non-hierarchical data relationships / multiple hierarchies at once – Describing hypothesis/statements of trust /reification (statements about other statements) Catch 22 • Everybody says metadata is important, but few actually record it – Frog in the pot – Tragedy of the Commons – Paradigm shift • What’s changing? – New Science drivers require it – New technologies will simplify capture and management Uses – Provenance (original conditions, subsequent workflow-workflow by example), • Reproducing experiments and analysis • Virtual Data • Workflow-by-example – Data Discovery • Metadata-based search (features, subsets, …) – Data Quality • evaluation/review • endorsement • Curation/records information – Annotation • Data context • Relation to other data – Discovery/Mining/Inference/Monitoring – Not discussed much – metadata applies not only to data but services, programs, machines, instruments R&D Challenges – What to standardize, what to record? • Infrastructure is general, some schema should be (workflow, experiment mgmt) but most are domain specific – Metadata Services • scalable, distributed, schema-independent • semantic federation/ontology mapping, derived indexes/info retrieval service, global ids, rich authorization models, data granularity, inference services, curation (tuning based on access, etc.) ) • Usability - Metadata input/capture – automation, cultural change, rewards, Google precedent • Maintainability - Automated quality management From workshop 1 Provenance: Conceptual Services • • • • • • • • • • • • • Logical Naming Lifecycle Discovery (data, schema) Basic Management (ingest, storage, query, update, notification, ...) Reasoning (mapping, inference, …) Records (signing, nonrepudiation) Migration (schema, formats, signatures, ...) Archival/versioning (copies of external data, services, …) Policy enforcement (fit for purpose, adheres to common data model, …) Federation & aggregation Collection and/or Compounding Curation (e.g. conflict detection & resolution) Workflow “Proxy” Program Scope – Research – metadata services (see above) – Pilot – use of rich metadata to support grand challenge projects – Develop/deploy: General metadata capture tools – capture from workflows, problem solving environments – Maintain – metadata management as cyberinfrastructure (requires research on scaling, maintainability,…)