Data Management and Representations in Ecce and CMCS Theresa L. Windus Pacific Northwest National Laboratory Environmental Molecular Sciences Laboratory Molecular Science Software Group Outline Some “definitions” Data and task representations Ecce CMCS Summary Acknowledgement 2 Data and metadata (one scientist’s data is another scientist’s metadata) H°atomiz 0 ( CH3OOH ) = 522.09 ± 2.02 kcal/mol [calculated, G3//B3LYP, T. Windus, more at http://...] data: value and uncertainty units: kcal/mol quantity: enthalpy of atomization species: methylhydroperoxide, CAS# 3031-73-0 temperature: 0 K calculated: G3//B3LYP creator: T. Windus using Ecce more info: http://avatar.emsl.pnl.gov:8080/Ecce/.../CH3OOH/.../GxEnergy 3 Metadata Converts Scientific Data into Knowledge Metadata provides identification and documentation to scientific data. Example: Attaching an owner, creation date, abstract, type to data. Example: Tracking data to program versions, and possibly bugs for that version. Metadata documents the context and value of the data. Example: The theoretical atomization energy of methylhydroperoxide (and its uncertainty) from Ecce (used as input to ATcT) contains information identifying the species and the quantity, units, the theoretical method used, vibrational frequencies and geometry, reference to source file, creator, etc. Metadata facilitates cross-scale transfer of data. Example: Can show a chain of inputs, including input parameters and configuration files, across scales. Example: Can retrieve literature references which describe this data. Metadata allows users to comment on the data and its quality. Example: Can be used for scientific peer review of data. Metadata is necessary for effective collaboration. Example: Scientific data becomes more usable to others when it is documented. Annotation is another term for metadata. Annotations can be added by either the data owner or a third party. 4 Data Pedigree: A Special Kind of Metadata Data pedigree or data provenance is a relationship which provides a “line of ancestors”. Pedigree allows for the categorization and tracing of the scientific data, and for the identification of the data’s ultimate origin, possibly across scales. Pedigree includes the series of steps necessary to reproduce the data. Data is linked, for example, to projects, references, inputs, and outputs. 5 Knowledge Grid A set of scalable tools, middleware, and services For the creation, analysis, dissemination, evaluation, and use Of data, information, and knowledge By individuals, groups, and communities …A digital place for performing ‘all’ aspects of science 6 Ecce & NWChem Ecce – Extensible Computational Chemistry Environment comprehensive problem solving environment common graphical user interfaces scientific modeling management seamless transfer of information between applications persistent data storage through DAV integrated scientific data management tools for ensuring efficient use of computing resources across a distributed network visualization of multi-dimensional data structures http://ecce.emsl.pnl.gov NWChem – massively parallel computational chemistry program Energetics, geometries, frequencies, etc. at various levels of theory http://www.emsl.pnl.gov/docs/nwchem 7 Ecce is… (cont.) 8 Ecce Architecture 9 Distributed Authoring and Versioning (DAV) An early web service (XML commands over HTTP) A widely adopted standard for metadata/data transport Put/Get data with arbitrary properties (dynamic) Properties can be discovered and accessed independently DASL, Versioning, Transactions, … 10 What does the WebDAV protocol provide? DAV Server Collection Data Storage Provider Properties Properties Resource Resource HTTP Applications WebDAV Collection Properties Collection Resource 11 Accessing WebDAV Server from Windows 2000 12 Accessing WebDAV Server Using Browser 13 Accessing WebDAV Server Using Ecce Calculation BasisSet Files Chemical System Properties 14 Ecce Physical Model contains Project Project contains Calculations are referred to as a “virtual document” because we distribute the structure across many physical objects. Physical collections and resources are URI addressable. Collections are unordered and allow mixed content. Calculation BasisSet Calculation Project Files Chemical System Properties is composed of Basis Set Chemical System Properties Setup Data/Logs 15 Calculation Setup Basis Set Tool Builder Template File Parameters .edml File Calculation Editor Geometry Perl ai.input ESP Basis Set Input Deck Theory Details Python Runtype Details Basis Set Reformatting Script Perl 16 Output Parsing Perl Output Job Monitor Parse Descriptor Text Block 1 Parse Script 1 Text Block 2 Parse Script 2 . . . . . . Text Block N Parse Script N Ecce DataBase Calculation Viewer 17 Example metadata On the calculation: On the molecule: http://www.emsl.pnl.gov/ecce:contenttype=ecceCalculation http://www.emsl.pnl.gov/ecce:empiricalFormula=H4C http://www.emsl.pnl.gov/ecce:resourcetype=VIRTUAL_DOCUMENT http://www.emsl.pnl.gov/ecce:charge=0.000000 http://www.emsl.pnl.gov/ecce:createdWith=v3.2 http://www.emsl.pnl.gov/ecce:useSymmetry=false http://www.emsl.pnl.gov/ecce:owner=d39974 http://www.emsl.pnl.gov/ecce:symmetrygroup=C1 http://www.emsl.pnl.gov/ecce:application=NWChem DAV:creationdate=2004-03-22T17:24:38Z http://www.emsl.pnl.gov/ecce:theory=SCF/RHF http://www.emsl.pnl.gov/ecce:spinmultiplicity=Singlet DAV:getcontentlength=386 http://www.emsl.pnl.gov/ecce:currentVersion=v3.2 DAV:getlastmodified=Mon, 22 Mar 2004 17:24:38 GMT http://www.emsl.pnl.gov/ecce:creationdate=Mon, 22 Mar 2004 17:24:00 GMT DAV:getetag="b28064-182-926a8180“ http://www.emsl.pnl.gov/ecce:reviewed=false DAV:executable=F http://www.emsl.pnl.gov/ecce:runtype=ESP DAV:supportedlock= http://www.emsl.pnl.gov/ecce:launch_machine=arunta http://www.emsl.pnl.gov/ecce:launch_nodes=1 DAV:getcontenttype=chemical/x-ecce-mvm http://www.emsl.pnl.gov/ecce:launch_rundir=/home/d39974/ecceruns http://www.emsl.pnl.gov/ecce:launch_totalprocs=1 http://www.emsl.pnl.gov/ecce:launch_user=d39974 http://www.emsl.pnl.gov/ecce:launch_maxmemory=0 http://www.emsl.pnl.gov/ecce:launch_remoteShell=ssh http://www.emsl.pnl.gov/ecce:job_jobid=13858 http://www.emsl.pnl.gov/ecce:job_path=/home/d39974/ecceruns/tracebug/esp http://www.emsl.pnl.gov/ecce:job_clienthost=arunta http://www.emsl.pnl.gov/ecce:startdate=Mon, 22 Mar 2004 17:25:11 GMT http://www.emsl.pnl.gov/ecce:version=Thu May 8 13:16:51 PDT 2003 Version 4.5 http://www.emsl.pnl.gov/ecce:state=Complete http://www.emsl.pnl.gov/ecce:completiondate=Mon, 22 Mar 2004 17:25:14 GMT DAV:resourcetype=<D:collection/> DAV:creationdate=2004-03-22T17:24:38Z DAV:getlastmodified=Mon, 22 Mar 2004 17:24:38 GMT DAV:getetag="b2805d-1000-926a8180“ DAV:supportedlock= DAV:getcontenttype=httpd/unix-directory 18 Example MVM file title: demo type: molecule num_atoms: 1065 atom_info: symbol cart atom_list: O -2.37400 -3.09100 13.5210 H -1.91600 -2.20200 14.0480 ... pdb_list: H O5* RC 1 157D A H H5T RC 1 157D A … attr_list: -0.622300 1 1 0 0 0.429500 1 1 0 0 … atom_type_list: OH HO … num_bonds: 1028 bond_list: 2 1 1.00000 1 3 1.00000 … 19 XML format for Properties <?xml version="1.0" encoding="utf-8" ?> <value name="CPUSEC" units="second">9.60000000000000e-01</value> <?xml version="1.0" encoding="utf-8" ?> <vector name="MLKNSHELL" rows="7" units="e" rowLabel="Unknown" rowLabels="1 2 3 4 5 6 7">1.99199825923126e+00 1.18803456337004e+00 3.08260463820159e+00 9.34340637068915e-01 9.34340635555820e-01 9.34340634042729e-01 9.34340632529639e-01</vector> <?xml version="1.0" encoding="utf-8" ?> <tsvectable name="GEOMTRACE" rows="5" units="Angstrom" columns="3" vectors="1" rowLabel="Atom,Coordinate" rowLabels="0 1 2 3 4" columnLabel="Coordinate" vectorLabel="Coordinate" columnLabels="X Y Z"><step number="1">0.000000000000000e+00 0.000000000000000e+00 0.000000000000000e+00 -6.755000000000000e01 -6.755000000000000e-01 6.755000000000000e-01 6.755000000000000e-01 6.755000000000000e-01 6.755000000000000e-01 6.755000000000000e-01 -6.755000000000000e-01 -6.755000000000000e-01 -6.755000000000000e-01 6.755000000000000e-01 -6.755000000000000e-01</step> <step number="2">6.767628142309400e-15 -6.950100046595310e-09 1.390021315920880e-08 6.239857395114590e-01 -6.239857464615680e-01 6.239857534116811e-01 6.239857568867110e-01 6.239857499366001e-01 6.239857707869190e-01 6.239857742619920e-01 -6.239857812120860e-01 -6.239857603617700e-01 -6.239857916372510e-01 6.239857846871540e-01 -6.239857777370440e-01</step> <step number="3">6.549446678833860e-15 1.124467050187860e-09 -2.248938851918010e-09 6.252750669032320e-01 -6.252750631744280e-01 6.252750594456050e-01 6.252750588833910e-01 6.252750626121890e-01 6.252750514257610e-01 6.252750508635410e-01 -6.252750471347340e-01 -6.252750583211300e-01 -6.252750428437061e-01 6.252750465725070e-01 -6.252750503012980e-01</step> </tsvectable> 20 Input Parameters Crossing the Molecular to Thermodynamic Scales Data Model Optimization and Frequencies B3LYP NWChem Input File Vinoxy Vibrational Mode Animated GIF B3LYP 6-31G* Pedigree is imperative to moving data across scales. NWChem Output File Properties Properties Input Parameters Gaussian Input Energy QCISD G3(MP2)B3LYP Hf Vinoxy NASA File QCISD(T,FC) Gaussian Output Vinoxy NWChem 6-31G* Ecce Input Parameters Energy Legend Properties Properties Gaussian CMCS MP2(FC) Vinoxy Active Tables NWChem Input MP2 G3MP2large Properties Properties Pedigree - hasInput NWChem Output Pedigree - hasOutput 21 Ecce publishing 22 The Multi-scale Challenge for Chemical Science Impact of chemical science relies upon flow of information across physical scales Data from smaller scales supports models at larger scales Critical science lies at scale interfaces Molecular properties, transport Mechanism validation, reduction Chemistry – fluid interactions The pedigree of information matters The propagation of data pedigree across scales is difficult Validation and data reliability is often a post-publication process Multi-scale science faces barriers Normal publication route is slow Numerous sub-disciplines employ different applications, formats, models Centers of excellence are geographically distributed 23 Multi-scale Chemical Science Data Unique terascale reacting flow simulation databases – collection of files @ N x t, and experimental data Chemical Mechanisms – k, MB files in various formats containing collections of reaction rates and transport coefficients. Modeled using theory, validated against experiments Kinetic rates – by measurement and computation. Tables collected, reviewed and annotated. NIST WebBook, publications Thermo-Chemistry- Tables of ‘constant’ properties of all molecules (of interest w/data) derived from many experiments, computations, extrapolations Quantum chemistry computations of molecular properties – data from one number to large potential energy surfaces - input to thermochemistry and reaction rate computations 24 CMCS Spans Scales & Geography Biggest barrier is “language” and informatics 25 Adaptive Informatics Infrastructure Infrastructure – a well designed, scalable, reusable, flexible set of tools, middleware, and services Informatics – the emerging use of semi-automated means to derive new knowledge from the analysis of (large amounts of) heterogeneous data, annotating existing data with its newly discovered meaning Adaptive – able to dynamically change to incorporate new knowledge and support new activities Low Barriers Powerful Many access points Storage of data in original formats with dynamic metadata extraction and translation Arbitrary formats (binary, ASCII, XML) Integrated data, metadata, pedigree across internal and external tools Evolvable Schema can be changed/extended as needed Metadata, translations, viewers, portal, etc. can be dynamically configured 26 CMCS Technical Choices Enable Adaptive, Longlived Infrastructure CMCS Data/Metadata services SAM Translation, Annotation WebDAV implementation Notification (JMS, NED) Search Pedigree browsing Core XML schema Security (JAAS) Quantum Chemistry Jetspeed (CHEF) CMCS Explorer Application portlets Community services Application Integration Webservices WebDAV API Multi-scale data including NIST access Kineticist Chemical Mechanisms ThermoChemistry Knowledge Management Tools Reacting Flow Kinetics Community Tools Research Support Tools Multi-scale Chemical Science Portal Chemistry Applications Shared Data Service Scientific Annotation Middleware Chemical Science Portal Thermochemist Parsers Translators Annotators WebDAV Annotation Annotation XML XML Text Data Set Data Set Data Set Annotation Binary Data Set Local Services/Grid Fabric Storage Security Event Services Directory Services A diagram representing the major conceptual elements of the CMCS Informatics Infrastructure. 27 How Metadata is Populated in CMCS SAM Metadata Services Layer When data is put into WebDAV, SAM causes XSLTs to be executed to extract metadata from XML files, based on MIME type. Similarly, Binary File Descriptor (BFD) provides an interface to extract metadata from binary files. Other translators can be used as well. CMCS data management/pedigree API to facilitate insertion and modification of metadata, in the proper XML format. Java code which allows software developers and scientists to easily write programs to add/edit metadata. Scientists can use these APIs to integrate with existing or new chemical science applications. Uses open source DAV and XML libraries. Any WebDAV client application DAVExplorer: Java application CMCSExplorer: Integrated in the CMCS portal 28 CMCS Metadata, Annotations, and Pedigree Using Dublin Core for some basic pedigree properties of electronic publication: creator, dates, publisher, is-referenced-by, references, etc. Digital library standard for metadata http://www.dublincore.org CMCS properties for Chemical Science to enable searching: species name, CAS, chemical properties, and chemical formula. CMCS properties for defining scientific data: inputs, outputs, and ispart-of-project. CMCS properties for scientific publication and peer review annotations: is-sanctioned-by. Currently defined more than 35 elements in the core CMCS pedigree. Flexible infrastructure for addition of new metadata. As new metadata is added to infrastructure,current apps will not break! CMCS metadata is strongly encouraged, though not required, for all CMCS data, and CMCS metadata is highly extensible. 29 Pedigree Browser Shows Input and Output Relationships 30 Pedigree Browsing Data is linked to projects, references, inputs, and outputs The Browser enables metadata editing. 31 Automatic Translation and Metadata Extraction Data translations provided automatically by SAM using previously registered XSLT’s for this file type. 32 Adaptive Infrastructure Enables Application Integration REACTIONLAB Browser, e-mail Browser, e-mail ELN 5.0 Ecce MCS Portal NWChem/ GRID RESOURCES Portlet Portlet API Fitdat Notification Web service Shared Data Repository API Active Table SAM SAM Web service Mime-type Assignment Metadata Extraction Translation Pedigree Relationships Grid Fabric Federation ML NIST Kinetics DB 33 Initial “Automatic Reasoning” Capability 34 Summary Users just want to have ease of use and flexibility in viewing output – adaptive informatics infrastructure “Standards” are useful, but it is necessary to be able to translate between diverse “schema” and “ontologies” Metadata converts scientific data into knowledge 35 Multi-disciplinary Ecce Development Team Gary Black -- Project lead Karen Schuchardt -- Software architect lead Bruce Palmer -- Chemist architect Todd Elsethagen -- Data management lead Erich Vorpagel – Chemist consultant Michael Peterson -- Operations support Mahin Hackler -- Operations support Sue Havre -- Application development Brett Didier -- Application development Carina Lansing -- Application development Steve Matsumoto -- Online help lead Colleen Winters -- Online help Doug Rice -- Online help 36 Multi-disciplinary CMCS Team Chemical Science Computer/Information Science Christine Yang, SNL Larry Rahn*, SNL Carmen Pancerella, SNL Renata McCoy, SNL Michael Lee, SNL Wendy Koegler, SNL Ed Walsh, SNL John Hewson, SNL David Montoya*, LANL William H. Green, Jr. *, MIT Lili Xu, LANL Michael Frenklach*, UCB Yen-Ling Ho, LANL William Pitz*, LLNL Michael Minkoff, ANL Thomas C. Allison*, NIST Sandra Bittner, ANL Gregor von Laszewski, ANL David Leahy, SNL Sandeep Nijsure, ANL Al Wagner*, ANL Kaizar Amin, ANL James D. Myers, PNL Branko Ruscic, ANL Brett Didier, PNL Reinhardt Pinzon, ANL Karen Schuchardt, PNL Baoshan Wang, ANL Eric Stephan, PNL Carina Lansing, PNL Theresa Windus*, PNL Elena Mendoza, PNL SAM National Collaboratory Program 37 Acknowledgements This research was performed in part using the Molecular Science Computing Facility (MSCF) in the William R. Wiley Environmental Laboratory at the Pacific Northwest National Laboratory (PNNL). The MSCF is funded by the Office of Biological and Environmental Research in the U. S. Department of Energy (DOE). PNNL is operated by Battelle for the U. S. Department of Energy under contract DEAC06-76RLO 1830. Funding is also provided by the Mathematics, Information and Computer Science and Basic Energy Sciences Division of DOE. 38 End 39