Toward a Common Data and Command Representation for Quantum Chemistry Philip Couch e-Science Outline of e-CCP1 Project • Investigate the technological requirements for enabling effective use of Grid resources by the quantum chemistry community – Middleware (Globus, Unicore, EGEE) – Compute resources – Client tools (CoG kits) – Track and develop, where necessary, the emerging standards in computational chemistry data and command representation (XMLbased CML, CMLComp, FSAtom). • Realise these requirements by developing some core tools that can be deployed and customised by CCP1 code developers. • Develop GUI interfaces that will operate with a range of CCP1 codes and implement Grid functionality. Presenter Name Philip Couch Facility Name e-Science Motivation • Motivation: – The emergence of Grid technologies has provided a generalised framework for the interoperability of computational codes. – A common data and command representation: • Promotes appropriate data re-use • Makes data available to a wider community – There are many existing ways to represent data – why not just convert between them (e.g. Open Babel) • Error prone • If there are n formats, n(n-1) converters are required. A solution is to find a common ‘middle ground’ for data (2n) Presenter Name Philip Couch Facility Name e-Science Data Types • What data could we represent? – Data/parameters • Structures • Scalar properties • Molecular orbitals • Normal modes of vibration • Dynamics • Basis sets • Force fields • Pseudo-potentials – Control • Energy convergence criteria • SCF steps • Mixing parameters • Mesh properties… • Some data can be shared amongst codes, others will be code specific – semantics is important Some of the data will be meta-data (e.g. code used, version, method…) Some of the data will define relationships between other data. • • Presenter Name Philip Couch Facility Name e-Science Data Representation • What are the existing ways of representing data? – Formats like CIF – Relational databases – XML (e.g. CML) – Objects, methods, data members (intermediate step) • But, how do we implement the data models (how do we define our vocabulary)? – SQL – XML schema – Class interfaces (e.g. W3C IDL based DOM recommendations) – UML Presenter Name Philip Couch Facility Name e-Science Semantics and Ontology Semantics • • Providing the meaning of vocabulary is important. – We want to ensure appropriate re-use of data. Semantics can be controlled by: – Annotating the data model (e.g. in XML schema <xsd:annotation>) – Links to external sources (e.g. XML dictionaries in CML) Ontology • An ontology can be thought of as ‘an explicit specification of concepts and the relationships between them.’ • Relationships between concepts can be expressed using the Resource Description Framework (RDF). RDF is the basis of ontology languages such as OWL and DAML+OIL. • RDF schema specify the relationships used by the RDF and the relationships between relationships… • An ontology helps to reduce implicit assumptions about data and their relationships. Presenter Name Philip Couch Facility Name e-Science XML Representation • XML is a strongly adopted and mature method of representing structured information • A vast and increasing range of tools makes XML easily readable and interpretable by applications authored by different groups • At the expense of conciseness: – XML is self describing – it carries meta-data – XML can be explicit about data • Some methods of representing data in XML already exist (e.g. CML), for which there are many tools Presenter Name Philip Couch Facility Name e-Science An Example A geometry representation for the CH molecule A basis set representation for the CH molecule Presenter Name Philip Couch Facility Name e-Science Relationships • How do we link the basis sets and geometries? – Could rely on implicit linking (<atom elementType=“C”…> with <basisSet id=“C1”…> • But what happens if we want to change the rules? – Could use attributes (<atom id=“a1” basis=“C1”…>) • But Documents could come from different sources, and don’t know about each others attributes • Continual revision of the data model – Could describe the relationship using RDF, or in an RDF-like manner • RDF/n3: • • • @prefix r1: <file://chGeom.xml#xpointer> . @prefix r2: <file://chBasis.xml#xpointer> . @prefix r3: <file://eccpRelations.html#> . • <r2:(//basisSet[@id=“C1"])> <r3:isBasisFor> <r1:(//atom[@elementType="C"])> . • <r2:(//basisSet[@id=“H1"])> <r3:isBasisFor> <r1:(//atom[@elementType="H"])> . Presenter Name Philip Couch Facility Name e-Science Relationships • • But… Passing text is not straight forward – can serialise RDF/n3 to RDF/XML • RDF/XML (converted using CWM) • • • • <rdf:RDF xmlns:r1="file://chGeom.xml#xpointer" xmlns:r2="file://chBasis.xml#xpointer" xmlns:r3="file://chRelations.html#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> • • • • <rdf:Description rdf:about="r2:(//basisSet[@id=&#34;C1&#34;])"> <isBasisFor xmlns="r3:" rdf:resource="r1:(//atom[@elementType=&#34;C&#34;])"/> </rdf:Description> • • • • • <rdf:Description rdf:about="r2:(//basisSet[@id=&#34;H1&#34;])"> <isBasisFor xmlns="r3:" rdf:resource="r1:(//atom[@elementType=&#34;H&#34;])"/> </rdf:Description> </rdf:RDF> Presenter Name Philip Couch Facility Name e-Science Other Design Considerations • How implicit/explicit should we be • When should we use ‘general’ or ‘grouping’ tags for data? – E.g. to take a CML-like example: Exchange energy = -1.025771783 or <eexchange>-1.025771783</eexchange> or <scalar dictRef=“eccp:eexchange”>-1.025771783</scalar> • To what extent should we tag data – E.g. <basisExponents>0.0 0.0 0.0 0.0</basisExponents> or <basisExponents> <n>0.0</n> <n>0.0</n> <n>0.0></n> <n>0.0</n> </basisExponents> Presenter Name Philip Couch Facility Name e-Science Using XML doc1.xml doc3.xm l doc2.xm l Parser in.txt Native/foreig n Application libraries XML Sax Dom Reading: Convert to standard application input (e.g. use XSLT) Read in the XML directly by using existing/writing your own native/foreign code Writing: Parse the standard application output and convert to XML out.txt Meta-data Write the XML directly by using existing/writing your own native/foreign code Parser out.xml Presenter Name Philip Couch Facility Name e-Science Using XML - Comments • Comments: – Careful choice of DOM or SAX parser implementation • DOM – potentially large overheads when used with large model instances • SAX – difficult to code when data is heavily cross referenced – Until recently, XML support for FOTRAN has been poor. No existing native parsers and it’s difficult to write your own • Solutions – native FORTRAN XML modules (Alberto Garcia) – FORTRAN DOM (Jon Wakelin) – XML libraries such as libXML and Xerces could be used with appropriate wrappers – There are mixed SAX and DOM API implementations, e.g libXML xmlTextReader – Parsing standard output is a good option for proprietary code, but suffers from versioning – Writing formatted data directly is error prone • FORTRAN WXML module (Alberto Garcia) • FORTRAN CML writer (Jon Wakelin) Presenter Name Philip Couch Facility Name e-Science Automation • Data models evolve with time. It is hard work to maintain code by hand. Ideally… schema XML API generator API ? wrapper objects application • CML - Java and C++ API generators • CCPN – Python API generators • Still have to worry about mapping the wrapper data structures to the internal data structures of the application. Presenter Name Philip Couch Facility Name e-Science Data Modelling • The focus is back to the data model. • SXD is not easy to interpret, impeding a collaborative approach to data model design • Designing is complicated by implementation decisions – it is a good idea to separate the conceptualisation and implementation • Represent the data model in the Unified Modelling Language (UML)? • This is a graphical notation (mainly) for expressing designs. • Can UML express XSD implementation decisions? – Yes, through UML stereotypes (subtypes of Meta-model types) – A UML profile (collection of stereotypes) for schema design has been developed by David Carlson Presenter Name Philip Couch Facility Name e-Science UML data model • UML equivalent to the XSD geometry and basis set data model • UML can be represented as XMI to facilitate the communication of data models between applications. Hypermodel will convert XMI to XSD. • Presenter Name Philip Couch Facility Name e-Science Binary Data • Some scientific data would be best stored in binary (e.g. molecular orbitals) • Binary data could simply be pointed to by XML • But… – Sharing binary data requires a machine independent way of storing it. • Could use: – HDF – NetCDF – BinX/DFDL Presenter Name Philip Couch Facility Name e-Science Current Status • Drafting CML-like markup and schema for some computational chemistry data – Basis sets – Molecular orbitals – Cartesian and internal coordinates – Molecular vibrations – Job parameters – Scalar quantities • Setup an eCCP1 Wiki for discussions (grids.ac.uk/eccp) • Setup NeSCForge project page for code/data model development Presenter Name Philip Couch Facility Name e-Science Current Status • Developing a C API for parsing CML geometries – Linked to libXML2 – Designed to scale well with xml file size (uses xmlTextReader) – Designed to be easily FORTRAN callable – Transparently reads gzipped XML files • Python module written to read in CML1/2 molecular atom information for the CCP1 GUI • GROWL (Grid Resources on Workstation Language), a C API for utilising current CLRC Grid portal services, is being developed. Presenter Name Philip Couch Facility Name e-Science Meeting Logistics Agenda Monday 5th April • • • • • • • • • Time Format Location 10.00 – 11.10 11.10 - 11.25 11.25 – 12.35 12.35 – 13.35 13.35 – 14.10 14.10 – 15.40 15.40 – 16.00 16.00 – 17.30 19.00 Presentations Refreshments Presentations Lunch Presentations Practical session and announcements Refreshments Practical session Conference dinner Lecture theatre Presentations Refreshments Presentations Lunch Open discussions Refreshments Open discussions Meeting close Lecture theatre Lecture theatre Lecture theatre Lecture theatre, training lab, Cramond Lecture theatre, training lab, Cramond Tuesday 6th April • • • • • • • • 09.00 – 10.10 10.10 – 10.25 10.25 – 12.10 12.10 – 13.10 13.10 – 14.25 14.25 – 14.40 14.40 – 16.00 16.00 Lecture theatre Cramond Cramond Presenter Name Philip Couch Facility Name e-Science Publication of meeting material 1. Contributed material to be published independently (e.g. NeSC technical report on CML) Meeting proceedings (summary) + presentations on web Create a meeting CD containing presentations and proceedings 2. Contributed material to be published independently Meeting proceedings (technical report written by authors of contributed material), focus on existing material and decisions on way forward. Create a meeting CD, presentations, proceedings, and code? Presenter Name Philip Couch Facility Name e-Science Discussion Topics • How to construct a working group – Who could be involved? • Process of data model refinement • Reference Implementation – Platforms? – What are the requirements? – Man power? • How focused should we be - data types to include, platforms…etc • How do we reach a consensus Presenter Name Philip Couch Facility Name e-Science