Metadata Semantics and the Earth System Curator Rocky Dunlap Earth System Curator Georgia Tech Earth System Curator 3 year NSF funded project Funded Collaborators: Cecelia DeLuca (NCAR, PI) Balaji (GFDL, Co-PI) Don Middleton (NCAR, Co-PI) Chris Hill (MIT, Co-PI) Spencer Rugaber (Ga Tech, Co-PI) Leo Mark (Ga Tech) Julien Chastang (NCAR) Sergey Nikonov (GFDL) Angela Navarro (Ga Tech) Me (Ga Tech) Also working with: Lois and Katherine (NMM) Sophie Valcke (PRISM/OASIS) Others... Curator Doctrine Currently a gap in the way we treat models and datasets (are they really so different?) Best description of a dataset is a comprehensive description of the model run that created the dataset (+ post processing) Model components are data objects for exchange Metadata-centric view Don’t start with a dataset and try to find the metadata... Start with good metadata that leads you to the datasets you want—even if they don’t yet exist! (No, really, that’s how we think.) Haiku are a valid form of model metadata Earth System Curator Applications (Proofs of Concept) Catalog of modeling components along with comprehensive metadata Demonstrate compatibility checking of components CDP Curator (Michael B., Don, Luca, Julien) Primarily “technical” compatibility: platforms, compilers, required fields, field data types, calendar/time Demonstrate auto-generation of coupler component based on metadata Demonstrate automation of workflow tasks Model assembly, execution, archive, postprocessing Schema Development Fun To accomplish these goals, we need: Comprehensive descriptions of climate models: model metadata Includes both “semantic” and “syntactic” elements (“discovery” vs. “use”) • Semantic: component name, type, owner, description, source code location, component architecture of model, platform, framework • Syntactic: parameter settings, input datasets, boundary conditions, coupling details, grid coordinates Lots of schemata... Component (NMM) Potential Model (NMM/Curator) Model (NMM) PMIOD/SMIOC (PRISM coupling spec) CRE/Curator Complete (workflow) Application (NMM) Gridspec Reminiscing on Metadata Development Observations: (It seems) much of the community is in support of metadata development • Although there are different opinions on levels of comprehensiveness People using metadata for different reasons: • • • • • Annotate large datasets for retrieval Inform analysis tools Archiving of modeling components Automation of workflow (runtime environ.) Exchange datasets Each application requires different (but often overlapping) metadata How should we think about schemata? Schemata are typically written for applications: I have a particular task I want to accomplish What metadata do I need to accomplish it? Write a schema. But... Now we have lots of schemata sitting around • They may contain overlapping information • Different ways of expressing the same information • Each schema is used for a small number of tasks and understood by a small number of applications • May need to reference elements in another schema, or aggregate elements from multiple schemata A Unified View of Metadata Given all of the current metadata development efforts, Curator is promoting a unified view of metadata Metadata reuse must be a priority Metadata aggregation is key: schemata built (generated!) from repository of existing metadata elements (let’s call them types) We must think conceptually first and then syntactically—ideally, all groups will agree at both levels What’s In a Schema? XML Schema (e.g., gridspec.xsd) GridTile ContactRegion These are syntactic and GridDescriptor conceptual constructs Boundary XML Type Re-using schema elements How do I best use/re-use metadata elements from (multiple) schema(ta) to accomplish my particular application? You need: A conceptual understanding of the “types” (concepts) in the schema Glossary The syntactic representation of that type (so you can actually use it in implementations) XML Type Library WE ARE HERE Multi-Schema Semantic Glossary Community-wide glossary of metadata types/concepts from multiple schemata Concepts aggregated into a centralized glossary Schema authors and users can get explanations/definitions of metadata elements. Examples: What does the contact_region tag mean in the Gridspec schema? What goes under the intent tag in the PMIOD? What is a potential model anyway? Multi-Schema Semantic Glossary For each metadata concept provide: Human-readable definition Source schema Example usage Change notes/provenance Semantic relationships with other concepts (e.g., broader than, narrower than, part of, parent of, synonym, etc.) Glossary Design Schema authors embed descriptions directly inside each XML schema Keep the human-readable definitions close to the formal syntactic definitions When schema is updated, it is easy to update glossary Glossary entries from distributed schemata are harvested (nightly?) and placed into centralized glossary (alternatively, live access?) Simple interface allows users to query glossary for concepts Glossary Design Simple Knowledge Organization Systems (SKOS) data model for glossary entries http://www.w3.org/2004/02/skos/ SKOS supports knowledge organization systems like glossaries, thesauri, taxonomies, etc. RDF based – move the community toward languages with higher semantics (eventually get down to dataset level) Sample SKOS RDF (Basic) <skos:Concept rdf:about="http://.../schema/1.0#PotentialModel"> <skos:prefLabel>potential model</skos:prefLabel> <skos:definition> A set of components at the source code level that can potentially form an executable model. ... </skos:definition> </skos:Concept> Where should glossary entries be stored? Example Annotated Schema ... <xsd:complexType name=“PotentialModel"> <xsd:annotation> <xsd:documentation> <skos:Concept rdf:about="http://.../schema/1.0#PotentialModel"> <skos:prefLabel>potential model</skos:prefLabel> <skos:definition> A set of components at the source code level that can potentially form an executable model. </skos:definition> </skos:Concept> </xsd:documentation> </xsd:annotation> <!-- rest of complexType definition goes here --> <xsd:complexType> ... Sample SKOS RDF Triples ‘potential model’ skos:prefLabel skos:Concept ‘A set of components at the source code level that can potentially form an executable model. ’ rdf:type esc:PotentialModel skos:definition Other SKOS Fields <skos:Concept rdf:about="http://purl.oclc.org/NMM/Model/011/#model"> <skos:prefLabel>model</skos:prefLabel> <skos:definition> The root element of a NMM Model description. There is one model per xml file. This model can have one or more related component configurations. </skos:definition> <skos:altLabel>simulation</skos:altLabel> <skos:altLabel>job</skos:altLabel> <skos:altLabel>run</skos:altLabel> <skos:example>UK Met Office Unified Model</skos:example> <skos:related rdf:resource=" http://...NMMPotentialModel/1.0/#PotentialModel"/> <skos:changeNote rdf:parseType="Resource"> <rdf:value>The label 'model' was changed from NMM_Model.</rdf:value> <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/"> <foaf:Person xmlns:foaf="http://xmlns.com/foaf/0.1/"> <foaf:name>Katherine Bouton</foaf:name> <foaf:mbox rdf:resource="mailto:..."/> </foaf:Person> </dc:creator> <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-02-02</dc:date> </skos:changeNote> <dc:source rdf:resource="http://purl.oclc.org/NMM/Model"/> </skos:Concept> Semantic Relationships nmm:Model skos:related esc:PotentialModel skosx:childOf skosx:childOf skos:synonym nmm:Component skos:synonym prism:Model Putting it all Together 1 Namespace Schemata (e.g., NMM, Curator-NMM, Gridspec, ESG) 2 Glossary metadata harvested nightly 3 4 Glossary Web Application Aggregate Glossary RDF 5 Client Web Browser SPARQL Queries Joseki RDF Server Marked up with glossary metadata (terms, definitions, relationships) Tomcat (www.earthsystemcurator.org/glossary) Search for terms, view relationships, etc. More info: http://glossary.earthsystemcurator.org/ http://www.earthsystemcurator.org/index.php?option=com_content&task=view& id=54&Itemid=84 Glossary Interface Search Concept List Schemata to Include Links to related concepts Concept Details Syntactic Metadata Re-use So, if we agree on the concepts, what about the syntax? (i.e., XML representation) Concept = XML Type How do we share XML types from multiple schemata across the community? One idea: XML Type Library (or Catalog or Repository) “Preliminary Research” This is NOT the same thing as a single complex schema that describes everything – types are first class objects and can be manipulated individually How does an XML Type Library work? Operations (web service?) Submit an XML type Get a list of all types Query for types Validate a type (Is my XML fragment a valid X?) Type membership (What types does my XML fragment fit?) Generate an XML Schema How does an XML Type Library work? What metadata is available per type? Definition (e.g., XML Schema complexType) SKOS Glossary entry (for queries) Example usage scenarios Dependencies on other types Versioning metadata Available operations/web services • “If you have an XML fragment of type X, you can use the following services...” Use Case: Submit Type Existing Schemata Extract Types <xsd:complexType name=“PotentialModel"> <xsd:annotation> name=“PotentialModel"> <xsd:complexType <xsd:complexType <xsd:documentation> <xsd:annotation> name=“PotentialModel"> <xsd:complexType name=“PotentialModel"> <xsd:annotation> <skos:Concept rdf:about="http://.../schema/1.0#PotentialModel"> <xsd:documentation> <xsd:complexType <xsd:annotation> name=“PotentialModel"> <xsd:documentation> <skos:prefLabel>potential model</skos:prefLabel> <skos:Concept rdf:about="http://.../schema/1.0#PotentialModel"> <xsd:annotation> <xsd:documentation> <skos:Concept rdf:about="http://.../schema/1.0#PotentialModel"> <skos:definition>A set of components at the source code... <skos:prefLabel>potential model</skos:prefLabel> <xsd:documentation> <skos:Concept rdf:about="http://.../schema/1.0#PotentialModel"> <skos:prefLabel>potential model</skos:prefLabel> </skos:definition> <skos:definition>A of components at the source code... <skos:Concept set rdf:about="http://.../schema/1.0#PotentialModel"> <skos:prefLabel>potential model</skos:prefLabel> <skos:definition>A set of components at the source code... </skos:Concept> </skos:definition> <skos:prefLabel>potential model</skos:prefLabel> <skos:definition>A set of components at the source code... </skos:definition> </xsd:documentation> </skos:Concept> <skos:definition>A set of components at the source code... </skos:definition> </skos:Concept> </xsd:annotation> </xsd:documentation> </skos:definition> </skos:Concept> </xsd:documentation> <!-rest of</skos:Concept> complexType definition goes here --> </xsd:annotation> </xsd:documentation> </xsd:annotation> <xsd:complexType> <!-- rest of complexType definition goes here --> </xsd:documentation> </xsd:annotation> <!-rest of complexType definition goes here --> <xsd:complexType> </xsd:annotation> <!-- rest of complexType definition goes here --> <xsd:complexType> <!-- rest of complexType definition goes here --> <xsd:complexType> <xsd:complexType> Submit to Type Library Use Case: Validation Type Library XML Fragment <horizontal_coord_system type=“cartesian”> <x_axis>...</x_axis> <y_axis>...</y_axis> </horizontal_coord_system> Validate “Valid” or “Invalid” Use Case: Find Services Type Library XML Fragment <horizontal_coord_system type=“cartesian”> <x_axis>...</x_axis> <y_axis>...</y_axis> </horizontal_coord_system> Find Services List of available services based on type of fragment Interpolate_Service() Extract_Variable() Massage_Data() Another_Operation() Some Conclusions With large amount of metadata activity already in progress, metadata re-use must be a priority Conceptual understanding is essential Adoption of a glossary of concepts Syntactic agreement is desirable Concepts assigned concrete XML types and stored in a library Some Haiku Retile the Shower Tessellated Mosaic First Write a Gridspec Forever summer questions and answers Curator complete Potential Model Like a cool autumn breeze Potentially mad Extra Slides... Example Gridspec Applications Not written for one particular application – general grid metadata has many potential uses IPCC Model Documentation table Moving variables to common grid for analysis Regridding vertical from 24 to 40 levels There are two levels: conceptual and syntactic – ideally, we would agree at both of these levels! If we only have conceptual agreement—we can still interoperate, but must do transformations Type Reuse Scenario Full Schema Partial Schemata Application: NARCCAP Vertical Interpolation Gridspec.xsd Description of vertical coordinate scheme Partial Schema } Metadata required for NARCCAP experiment: interpolate from 24 to 40 vertical levels Schema Aggregation Scenario Schema A Schema B Schema C Schema D XML Type Application Schema Application: Component Compatibility Checking NMM Component Coupling Spec (PMIOD) Gridspec Required coupling fields Technical details (e.g., supported platforms) Application Schema Horizontal grid descriptor } All metadata required for compatibility checking of two components