Theresa L. Windus - National e

advertisement
Data Management and
Representations in Ecce and CMCS
Theresa L. Windus
Pacific Northwest National Laboratory
Environmental Molecular Sciences Laboratory
Molecular Science Software Group
Outline
Some “definitions”
Data and task representations


Ecce
CMCS
Summary
Acknowledgement
2
Data and metadata
(one scientist’s data is another scientist’s metadata)
H°atomiz 0 ( CH3OOH ) = 522.09 ± 2.02 kcal/mol
[calculated, G3//B3LYP, T. Windus, more at http://...]
data: value and uncertainty
units: kcal/mol
quantity: enthalpy of atomization
species: methylhydroperoxide, CAS# 3031-73-0
temperature: 0 K
calculated: G3//B3LYP
creator: T. Windus using Ecce
more info: http://avatar.emsl.pnl.gov:8080/Ecce/.../CH3OOH/.../GxEnergy
3
Metadata Converts Scientific Data into
Knowledge
Metadata provides identification and documentation to scientific data.


Example: Attaching an owner, creation date, abstract, type to data.
Example: Tracking data to program versions, and possibly bugs for that version.
Metadata documents the context and value of the data.

Example: The theoretical atomization energy of methylhydroperoxide (and its uncertainty) from
Ecce (used as input to ATcT) contains information identifying the species and the quantity, units, the
theoretical method used, vibrational frequencies and geometry, reference to source file, creator,
etc.
Metadata facilitates cross-scale transfer of data.


Example: Can show a chain of inputs, including input parameters and configuration
files, across scales.
Example: Can retrieve literature references which describe this data.
Metadata allows users to comment on the data and its quality.

Example: Can be used for scientific peer review of data.
Metadata is necessary for effective collaboration.

Example: Scientific data becomes more usable to others when it is documented.
Annotation is another term for metadata. Annotations can be added by
either the data owner or a third party.
4
Data Pedigree: A Special Kind of Metadata
Data pedigree or data provenance
is a relationship which provides a
“line of ancestors”.
Pedigree allows for the
categorization and tracing of the
scientific data, and for the
identification of the data’s ultimate
origin, possibly across scales.
Pedigree includes the series of
steps necessary to reproduce the
data.
Data is linked, for example, to
projects, references, inputs, and
outputs.
5
Knowledge Grid
A set of scalable tools, middleware, and services
For the creation, analysis, dissemination,
evaluation, and use
Of data, information, and knowledge
By individuals, groups, and communities
…A digital place for performing ‘all’ aspects of
science
6
Ecce & NWChem
Ecce – Extensible Computational
Chemistry Environment









comprehensive problem solving
environment
common graphical user interfaces
scientific modeling management
seamless transfer of information between
applications
persistent data storage through DAV
integrated scientific data management
tools for ensuring efficient use of
computing resources across a
distributed network
visualization of multi-dimensional data
structures
http://ecce.emsl.pnl.gov
NWChem – massively parallel
computational chemistry program


Energetics, geometries, frequencies, etc.
at various levels of theory
http://www.emsl.pnl.gov/docs/nwchem 7
Ecce is… (cont.)
8
Ecce Architecture
9
Distributed Authoring and Versioning (DAV)
An early web service (XML commands over HTTP)
A widely adopted standard for metadata/data transport
Put/Get data with arbitrary properties (dynamic)
Properties can be discovered and accessed
independently
DASL, Versioning, Transactions, …
10
What does the WebDAV protocol provide?
DAV Server
Collection
Data
Storage
Provider
Properties
Properties
Resource
Resource
HTTP
Applications
WebDAV
Collection
Properties
Collection
Resource
11
Accessing WebDAV Server from Windows 2000
12
Accessing WebDAV Server Using Browser
13
Accessing WebDAV Server Using Ecce
Calculation
BasisSet
Files
Chemical
System
Properties
14
Ecce Physical Model
contains
Project
Project
contains
Calculations are referred
to as a “virtual document”
because we distribute the
structure across many
physical objects.
Physical collections and
resources are URI
addressable.
Collections are unordered
and allow mixed content.
Calculation
BasisSet
Calculation
Project
Files
Chemical
System
Properties
is composed of
Basis Set
Chemical System
Properties
Setup Data/Logs
15
Calculation Setup
Basis Set
Tool
Builder
Template
File
Parameters
.edml File
Calculation
Editor
Geometry
Perl
ai.input
ESP
Basis Set
Input Deck
Theory
Details
Python
Runtype
Details
Basis Set
Reformatting
Script
Perl
16
Output Parsing
Perl
Output
Job Monitor
Parse
Descriptor
Text Block 1
Parse Script 1
Text Block 2
Parse Script 2
.
.
.
.
.
.
Text Block N
Parse Script N
Ecce
DataBase
Calculation
Viewer
17
Example metadata
On the calculation:
On the molecule:
http://www.emsl.pnl.gov/ecce:contenttype=ecceCalculation
http://www.emsl.pnl.gov/ecce:empiricalFormula=H4C
http://www.emsl.pnl.gov/ecce:resourcetype=VIRTUAL_DOCUMENT
http://www.emsl.pnl.gov/ecce:charge=0.000000
http://www.emsl.pnl.gov/ecce:createdWith=v3.2
http://www.emsl.pnl.gov/ecce:useSymmetry=false
http://www.emsl.pnl.gov/ecce:owner=d39974
http://www.emsl.pnl.gov/ecce:symmetrygroup=C1
http://www.emsl.pnl.gov/ecce:application=NWChem
DAV:creationdate=2004-03-22T17:24:38Z
http://www.emsl.pnl.gov/ecce:theory=SCF/RHF
http://www.emsl.pnl.gov/ecce:spinmultiplicity=Singlet
DAV:getcontentlength=386
http://www.emsl.pnl.gov/ecce:currentVersion=v3.2
DAV:getlastmodified=Mon, 22 Mar 2004 17:24:38 GMT
http://www.emsl.pnl.gov/ecce:creationdate=Mon, 22 Mar 2004 17:24:00 GMT
DAV:getetag="b28064-182-926a8180“
http://www.emsl.pnl.gov/ecce:reviewed=false
DAV:executable=F
http://www.emsl.pnl.gov/ecce:runtype=ESP
DAV:supportedlock=
http://www.emsl.pnl.gov/ecce:launch_machine=arunta
http://www.emsl.pnl.gov/ecce:launch_nodes=1
DAV:getcontenttype=chemical/x-ecce-mvm
http://www.emsl.pnl.gov/ecce:launch_rundir=/home/d39974/ecceruns
http://www.emsl.pnl.gov/ecce:launch_totalprocs=1
http://www.emsl.pnl.gov/ecce:launch_user=d39974
http://www.emsl.pnl.gov/ecce:launch_maxmemory=0
http://www.emsl.pnl.gov/ecce:launch_remoteShell=ssh
http://www.emsl.pnl.gov/ecce:job_jobid=13858
http://www.emsl.pnl.gov/ecce:job_path=/home/d39974/ecceruns/tracebug/esp
http://www.emsl.pnl.gov/ecce:job_clienthost=arunta
http://www.emsl.pnl.gov/ecce:startdate=Mon, 22 Mar 2004 17:25:11 GMT
http://www.emsl.pnl.gov/ecce:version=Thu May 8 13:16:51 PDT 2003 Version 4.5
http://www.emsl.pnl.gov/ecce:state=Complete
http://www.emsl.pnl.gov/ecce:completiondate=Mon, 22 Mar 2004 17:25:14 GMT
DAV:resourcetype=<D:collection/>
DAV:creationdate=2004-03-22T17:24:38Z
DAV:getlastmodified=Mon, 22 Mar 2004 17:24:38 GMT
DAV:getetag="b2805d-1000-926a8180“
DAV:supportedlock=
DAV:getcontenttype=httpd/unix-directory
18
Example MVM file
title: demo
type: molecule
num_atoms: 1065
atom_info: symbol cart
atom_list:
O -2.37400 -3.09100 13.5210
H -1.91600 -2.20200 14.0480
...
pdb_list:
H O5* RC
1 157D A
H H5T RC
1 157D A
…
attr_list:
-0.622300 1 1 0 0
0.429500 1 1 0 0
…
atom_type_list:
OH
HO
…
num_bonds: 1028
bond_list:
2 1 1.00000
1 3 1.00000
…
19
XML format for Properties
<?xml version="1.0" encoding="utf-8" ?>
<value name="CPUSEC" units="second">9.60000000000000e-01</value>
<?xml version="1.0" encoding="utf-8" ?>
<vector name="MLKNSHELL" rows="7" units="e" rowLabel="Unknown" rowLabels="1 2 3 4 5 6
7">1.99199825923126e+00 1.18803456337004e+00 3.08260463820159e+00 9.34340637068915e-01
9.34340635555820e-01 9.34340634042729e-01 9.34340632529639e-01</vector>
<?xml version="1.0" encoding="utf-8" ?>
<tsvectable name="GEOMTRACE" rows="5" units="Angstrom" columns="3" vectors="1" rowLabel="Atom,Coordinate"
rowLabels="0 1 2 3 4" columnLabel="Coordinate" vectorLabel="Coordinate" columnLabels="X Y Z"><step
number="1">0.000000000000000e+00 0.000000000000000e+00 0.000000000000000e+00 -6.755000000000000e01
-6.755000000000000e-01 6.755000000000000e-01 6.755000000000000e-01 6.755000000000000e-01
6.755000000000000e-01 6.755000000000000e-01 -6.755000000000000e-01 -6.755000000000000e-01
-6.755000000000000e-01 6.755000000000000e-01 -6.755000000000000e-01</step>
<step number="2">6.767628142309400e-15 -6.950100046595310e-09 1.390021315920880e-08 6.239857395114590e-01
-6.239857464615680e-01 6.239857534116811e-01 6.239857568867110e-01 6.239857499366001e-01
6.239857707869190e-01 6.239857742619920e-01 -6.239857812120860e-01 -6.239857603617700e-01
-6.239857916372510e-01 6.239857846871540e-01 -6.239857777370440e-01</step>
<step number="3">6.549446678833860e-15 1.124467050187860e-09 -2.248938851918010e-09 6.252750669032320e-01
-6.252750631744280e-01 6.252750594456050e-01 6.252750588833910e-01 6.252750626121890e-01
6.252750514257610e-01 6.252750508635410e-01 -6.252750471347340e-01 -6.252750583211300e-01
-6.252750428437061e-01 6.252750465725070e-01 -6.252750503012980e-01</step>
</tsvectable>
20
Input Parameters
Crossing the Molecular to
Thermodynamic Scales Data Model
Optimization and
Frequencies
B3LYP
NWChem
Input File
Vinoxy
Vibrational Mode
Animated GIF
B3LYP
6-31G*
Pedigree is imperative to moving data
across scales.
NWChem
Output File
Properties
Properties
Input Parameters
Gaussian
Input
Energy
QCISD
G3(MP2)B3LYP
Hf Vinoxy
NASA File
QCISD(T,FC)
Gaussian
Output
Vinoxy
NWChem
6-31G*
Ecce
Input Parameters
Energy
Legend
Properties
Properties
Gaussian
CMCS
MP2(FC)
Vinoxy
Active Tables
NWChem
Input
MP2
G3MP2large
Properties
Properties
Pedigree - hasInput
NWChem
Output
Pedigree - hasOutput
21
Ecce publishing
22
The Multi-scale Challenge
for Chemical Science
Impact of chemical science relies upon
flow of information across physical scales


Data from smaller scales supports models
at larger scales
Critical science lies at scale interfaces



Molecular properties, transport
Mechanism validation, reduction
Chemistry – fluid interactions
The pedigree of information matters


The propagation of data pedigree across
scales is difficult
Validation and data reliability is often a
post-publication process
Multi-scale science faces barriers



Normal publication route is slow
Numerous sub-disciplines employ different
applications, formats, models
Centers of excellence are geographically
distributed
23
Multi-scale Chemical Science Data
Unique terascale reacting flow simulation
databases – collection of files @ N x t, and
experimental data
Chemical Mechanisms – k, MB files in various
formats containing collections of reaction rates
and transport coefficients. Modeled using
theory, validated against experiments
Kinetic rates – by measurement and
computation. Tables collected, reviewed and
annotated. NIST WebBook, publications
Thermo-Chemistry- Tables of ‘constant’
properties of all molecules (of interest w/data)
derived from many experiments, computations,
extrapolations
Quantum chemistry computations of molecular
properties – data from one number to large
potential energy surfaces - input to thermochemistry and reaction rate computations
24
CMCS Spans Scales &
Geography
Biggest barrier is
“language” and informatics
25
Adaptive Informatics Infrastructure
Infrastructure – a well designed, scalable, reusable, flexible set of tools,
middleware, and services
Informatics – the emerging use of semi-automated means to derive new
knowledge from the analysis of (large amounts of) heterogeneous data,
annotating existing data with its newly discovered meaning
Adaptive – able to dynamically change to incorporate new knowledge
and support new activities

Low Barriers



Powerful



Many access points
Storage of data in original formats with dynamic metadata extraction and
translation
Arbitrary formats (binary, ASCII, XML)
Integrated data, metadata, pedigree across internal and external tools
Evolvable


Schema can be changed/extended as needed
Metadata, translations, viewers, portal, etc. can be dynamically configured
26
CMCS Technical Choices Enable Adaptive, Longlived Infrastructure
CMCS Data/Metadata services







SAM Translation, Annotation
WebDAV implementation
Notification (JMS, NED)
Search
Pedigree browsing
Core XML schema
Security (JAAS)
Quantum
Chemistry



Jetspeed (CHEF)
CMCS Explorer
Application portlets
Community services
Application Integration



Webservices
WebDAV API
Multi-scale data including NIST
access
Kineticist
Chemical
Mechanisms
ThermoChemistry
Knowledge
Management
Tools
Reacting
Flow
Kinetics
Community
Tools
Research Support
Tools
Multi-scale
Chemical
Science
Portal
Chemistry
Applications
Shared Data Service
Scientific Annotation Middleware
Chemical Science Portal

Thermochemist
Parsers Translators Annotators WebDAV
Annotation
Annotation
XML
XML
Text
Data Set
Data Set
Data Set
Annotation
Binary
Data Set
Local Services/Grid Fabric
Storage Security
Event Services
Directory Services
A diagram representing the major conceptual elements of
the CMCS Informatics Infrastructure.
27
How Metadata is Populated in CMCS
SAM Metadata Services Layer



When data is put into WebDAV, SAM causes XSLTs to be executed to
extract metadata from XML files, based on MIME type.
Similarly, Binary File Descriptor (BFD) provides an interface to extract
metadata from binary files.
Other translators can be used as well.
CMCS data management/pedigree API to facilitate insertion and
modification of metadata, in the proper XML format.



Java code which allows software developers and scientists to easily write
programs to add/edit metadata.
Scientists can use these APIs to integrate with existing or new chemical
science applications.
Uses open source DAV and XML libraries.
Any WebDAV client application


DAVExplorer: Java application
CMCSExplorer: Integrated in the CMCS portal
28
CMCS Metadata, Annotations, and Pedigree
Using Dublin Core for some basic pedigree properties of electronic
publication: creator, dates, publisher, is-referenced-by, references, etc.


Digital library standard for metadata
http://www.dublincore.org
CMCS properties for Chemical Science to enable searching: species
name, CAS, chemical properties, and chemical formula.
CMCS properties for defining scientific data: inputs, outputs, and ispart-of-project.
CMCS properties for scientific publication and peer review annotations:
is-sanctioned-by.
Currently defined more than 35 elements in the core CMCS pedigree.
Flexible infrastructure for addition of new metadata. As new metadata
is added to infrastructure,current apps will not break!
CMCS metadata is strongly encouraged, though not required, for all
CMCS data, and CMCS metadata is highly extensible.
29
Pedigree Browser Shows Input and Output
Relationships
30
Pedigree Browsing
Data is linked to
projects, references,
inputs, and outputs
The Browser enables
metadata editing.
31
Automatic Translation and
Metadata Extraction
Data translations provided
automatically by SAM using
previously registered XSLT’s
for this file type.
32
Adaptive Infrastructure Enables
Application Integration
REACTIONLAB
Browser,
e-mail
Browser,
e-mail
ELN 5.0
Ecce
MCS Portal
NWChem/
GRID RESOURCES
Portlet
Portlet
API
Fitdat
Notification
Web service
Shared Data Repository
API
Active Table
SAM
SAM
Web service
Mime-type Assignment
Metadata Extraction
Translation
Pedigree Relationships
Grid Fabric
Federation ML
NIST
Kinetics
DB
33
Initial “Automatic Reasoning” Capability
34
Summary
Users just want to have ease of use
and flexibility in viewing output –
adaptive informatics infrastructure
“Standards” are useful, but it is
necessary to be able to translate
between diverse “schema” and
“ontologies”
Metadata converts scientific data into
knowledge
35
Multi-disciplinary Ecce Development Team
Gary Black -- Project lead
Karen Schuchardt -- Software architect lead
Bruce Palmer -- Chemist architect
Todd Elsethagen -- Data management lead
Erich Vorpagel – Chemist consultant
Michael Peterson -- Operations support
Mahin Hackler -- Operations support
Sue Havre -- Application development
Brett Didier -- Application development
Carina Lansing -- Application development
Steve Matsumoto -- Online help lead
Colleen Winters -- Online help
Doug Rice -- Online help
36
Multi-disciplinary CMCS Team
Chemical Science
Computer/Information Science
Christine Yang, SNL
Larry Rahn*, SNL
Carmen Pancerella, SNL
Renata McCoy, SNL
Michael Lee, SNL
Wendy Koegler, SNL Ed Walsh, SNL
John Hewson, SNL
David Montoya*, LANL
William H. Green, Jr. *, MIT
Lili Xu, LANL
Michael Frenklach*, UCB
Yen-Ling Ho, LANL
William Pitz*, LLNL
Michael Minkoff, ANL
Thomas C. Allison*, NIST
Sandra Bittner, ANL
Gregor von Laszewski, ANL
David Leahy, SNL
Sandeep Nijsure, ANL
Al Wagner*, ANL
Kaizar Amin, ANL
James D. Myers, PNL
Branko Ruscic, ANL
Brett Didier, PNL
Reinhardt Pinzon, ANL
Karen Schuchardt, PNL
Baoshan Wang, ANL
Eric Stephan, PNL
Carina Lansing, PNL
Theresa Windus*, PNL
Elena Mendoza, PNL
SAM
National Collaboratory Program
37
Acknowledgements
This research was performed in part using the
Molecular Science Computing Facility (MSCF) in
the William R. Wiley Environmental Laboratory at
the Pacific Northwest National Laboratory (PNNL).
The MSCF is funded by the Office of Biological and
Environmental Research in the U. S. Department of
Energy (DOE). PNNL is operated by Battelle for the
U. S. Department of Energy under contract DEAC06-76RLO 1830. Funding is also provided by
the Mathematics, Information and Computer
Science and Basic Energy Sciences Division of
DOE.
38
End
39
Download