Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

advertisement
Collaboratory for Multi-scale Chemical
Science (CMCS):
A Knowledge Grid/ Adaptive
Informatics Infrastructure
Jim Myers, Carmen Pancerella
Data Provenance and Annotation Dec. 2, 2003
CMCS – Enabling New Forms of
Research and Communication
z
Distributed Research Groups
z
Chemical Databases
z
Rich Publication
z
Community Annotation
z
Informatics Analysis
z
Cross-scale Communication
z
Peer Data Review
z
Pedigree Analysis
z
Automated informatics
z
Automated
monitoring/analysis
Data Provenance and Annotation Dec. 2, 2003
Adaptive Informatics Infrastructure
z
Infrastructure – a well designed, scalable, reusable, flexible set of
tools, middleware, and services
z
Informatics – the emerging use of semi-automated means to derive
new knowledge from the analysis of (large amounts of)
heterogeneous data, annotating existing data with its newly
discovered meaning
z
Adaptive – able to dynamically change to incorporate new knowledge
and support new activities
›
Low Barriers
z
z
›
Powerful
z
z
›
Many access points
Storage of data in original formats with dynamic metadata extraction and
translation
Arbitrary formats (binary, ASCII, XML)
Integrated data, metadata, pedigree across internal and external tools
Evolvable
z
z
Schema can be changed/extended as needed
Metadata, translations, viewers, portal, etc. can be dynamically configured
Data Provenance and Annotation Dec. 2, 2003
Database
Notebook Services
Semantic Services
Metadata Services
DAV, JDBC, GridFTP
DAV, DASL, JMS, SAM Extensions
SAM Architecture
Web
DataGrid
Data Provenance and Annotation Dec. 2, 2003
SAM Metadata Services Layer
z
Jakarta Slide DAV server plus configurable:
› Mime Type Assignment
z
CMCS default: Based on dc:format tag within .xml file
› Property Generation from binary/ASCII/XML files
z
12 types Æ standard CMCS properties
› Resource Translation
z
12+ Viewers/Translators for CMCS including Interactive Applets
› Mapping to Data Store(s)
z
NIST Kinetics DB
› JMS Events for access and changes
z
Feeds events to CMCS NED Email Notification daemon
› Authentication/Authorization model
z
(single sign-on with CMCS Portal – username/password or GridCert)
Data Provenance and Annotation Dec. 2, 2003
Extensible Scientific Interchange Language (XSIL) /
Binary Format Description (BFD) language
z
XSIL (Roy Williams, CalTech) - XML Encoding and
Java code for scientific data
›
›
›
z
BFD (Alan Chappell, Jim Myers, PNNL) XML
Encoding and Java code for describing binary/ascii
files
›
›
›
z
Ints, floats, vectors, arrays, time series, …
Can describe the byte structure of external data
files/streams (encoding, byte order,…)
Can have link(s) to external data
Bug fixes, removed ambiguities
Parameterized logic (if, while, for…)
Parameterized Stream interface
Being used as input for Grid Forum Data Format
description Language (DFDL) standard
<XSIL>
<Param Name="date" Type="String" />
<Param Name="Program Version" Type="float" />
<Param Name="numColumns" Type="int" />
<Array Name="data" Type="float">
<Dim>
<XBFDvalue-of select
="/XSIL/Param[@Name='numColumns']" />
</Dim>
<Dim>6</Dim>
</Array>
<Stream Encoding="Binary" Type="Remote“
XBFDstreamnumber="0" />
</XSIL>
Data Provenance and Annotation Dec. 2, 2003
Demo
Data Provenance and Annotation Dec. 2, 2003
Example
z
Binary Æ XML Æ Properties
z
Translation of Chemistry Data
z
SAM-based Electronic Notebook
z
CMCS Portal/Pedigree Browser
ELN
DAV+
Fortran
Application
DAV
‘Local
Disk’
JMS
DataGrid
Data Provenance and Annotation Dec. 2, 2003
CMCS Provenance:
de-facto standards
z
Cmcs:hasinputs – workflow
z
Cmcs:hasoutputs – workflow
z
Sam:hastranslations – virtual workflow
z
Cmcs:ispartofproject – hierarchy
z
Eln:children – hierarchy
z
(Dav:collection) – hierarchy
z
Dcterms:references – scientific pedigree
z
Dcterms:isreferencedby – scientific pedigree
z
Eln:references – informal/private scientific pedigree
Data Provenance and Annotation Dec. 2, 2003
Applications/Chemistry Services
z
Extensible Computational Chemistry Environment
› Export to CMCS with pedigree/metadata
z
Active Thermochemical Tables
› Portlet/web service using CMCS data store
z
RIOT – adaptive mechanism reduction
› Portlet/web service using CMCS data store –
asynchronous invocation mechanism
Data Provenance and Annotation Dec. 2, 2003
Standard Protocol and API
z
WebDAV: An early web service (XML commands over HTTP)
›
›
›
›
z
A widely adopted standard for metadata/data transport
Put/Get data with arbitrary properties (dynamic)
Properties can be discovered and accessed independently
DASL, Versioning, Transactions, …
JSR 170: Java Content Repository
›
An API for working with nodes with properties (versioning, queries,
typing, notification, …)
Data Provenance and Annotation Dec. 2, 2003
Path Forward
z
Pilot groups doing “real” chemistry
z
Exploring new practice
› Peer-Review / Endorsement Mechanisms/Interfaces
z
Digital publication, third party annotation
› Activity Reporting tools
› Scoping Searches, Notifications
z Based on user-defined notion of provenance/hierarchy
› Notebook Views of Other Hierarchies
z E.g. A notebook sharing a computational chemistry project
hierarchy
› Validation of Chemical networks
z E.g. Active Thermo-chemical Tables
› Workflow by Example…
› Informatics Data File Assembly Tool
Data Provenance and Annotation Dec. 2, 2003
URLs/Team Members
z
http://cmcs.org/
z
http://www.scidac.org/SAM/
CMCS Team Members: Thomas C. Allison, Kaizar Amin, Sandra
Bittner, Brett Didier, Michael Frenklach, William H. Green, Jr., YenLing Ho, John Hewson, Wendy Koegler, Carina Lansing, David Leahy,
Michael Lee, Renata McCoy, Michael Minkoff, James D. Myers,
Sandeep Nijsure, Gregor von Laszewski, David Montoya, Carmen
Pancerella, Reinhardt Pinzon, William Pitz, Larry Rahn, Branko
Ruscic, Karen Schuchardt, Eric Stephan, Al Wagner, Baoshan Wang,
Theresa Windus, Lili Xu, Christine Yang
Data Provenance and Annotation Dec. 2, 2003
Download