Toward a Common Data and Command Representation for Quantum Chemistry

advertisement
Toward a Common Data
and Command
Representation for
Quantum Chemistry
Philip Couch
e-Science
Outline of e-CCP1 Project
• Investigate the technological requirements for enabling effective
use of Grid resources by the quantum chemistry community
– Middleware (Globus, Unicore, EGEE)
– Compute resources
– Client tools (CoG kits)
– Track and develop, where necessary, the emerging standards in
computational chemistry data and command representation (XMLbased CML, CMLComp, FSAtom).
• Realise these requirements by developing some core tools that can
be deployed and customised by CCP1 code developers.
• Develop GUI interfaces that will operate with a range of CCP1 codes
and implement Grid functionality.
Presenter
Name
Philip Couch
Facility
Name
e-Science
Motivation
• Motivation:
– The emergence of Grid technologies has provided a generalised
framework for the interoperability of computational codes.
– A common data and command representation:
• Promotes appropriate data re-use
• Makes data available to a wider community
– There are many existing ways to represent data – why not just
convert between them (e.g. Open Babel)
• Error prone
• If there are n formats, n(n-1) converters are required. A
solution is to find a common ‘middle ground’ for data (2n)
Presenter
Name
Philip Couch
Facility
Name
e-Science
Data Types
•
What data could we represent?
– Data/parameters
• Structures
• Scalar properties
• Molecular orbitals
• Normal modes of vibration
• Dynamics
• Basis sets
• Force fields
• Pseudo-potentials
– Control
• Energy convergence criteria
• SCF steps
• Mixing parameters
• Mesh properties…
•
Some data can be shared amongst codes, others will be code specific – semantics is
important
Some of the data will be meta-data (e.g. code used, version, method…)
Some of the data will define relationships between other data.
•
•
Presenter
Name
Philip Couch
Facility
Name
e-Science
Data Representation
• What are the existing ways of representing data?
– Formats like CIF
– Relational databases
– XML (e.g. CML)
– Objects, methods, data members (intermediate step)
• But, how do we implement the data models (how do we define our
vocabulary)?
– SQL
– XML schema
– Class interfaces (e.g. W3C IDL based DOM recommendations)
– UML
Presenter
Name
Philip Couch
Facility
Name
e-Science
Semantics and Ontology
Semantics
•
•
Providing the meaning of vocabulary is important.
– We want to ensure appropriate re-use of data.
Semantics can be controlled by:
– Annotating the data model (e.g. in XML schema <xsd:annotation>)
– Links to external sources (e.g. XML dictionaries in CML)
Ontology
•
An ontology can be thought of as ‘an explicit specification of concepts and
the relationships between them.’
•
Relationships between concepts can be expressed using the Resource
Description Framework (RDF). RDF is the basis of ontology languages such as
OWL and DAML+OIL.
•
RDF schema specify the relationships used by the RDF and the relationships
between relationships…
•
An ontology helps to reduce implicit assumptions about data and their
relationships.
Presenter Name
Philip Couch
Facility
Name
e-Science
XML Representation
• XML is a strongly adopted and mature method of representing
structured information
• A vast and increasing range of tools makes XML easily readable and
interpretable by applications authored by different groups
• At the expense of conciseness:
– XML is self describing – it carries meta-data
– XML can be explicit about data
• Some methods of representing data in XML already exist (e.g. CML),
for which there are many tools
Presenter
Name
Philip Couch
Facility
Name
e-Science
An Example
A geometry representation for the CH molecule
A basis set representation for the CH molecule
Presenter
Name
Philip Couch
Facility
Name
e-Science
Relationships
•
How do we link the basis sets and geometries?
– Could rely on implicit linking (<atom elementType=“C”…> with <basisSet id=“C1”…>
• But what happens if we want to change the rules?
– Could use attributes (<atom id=“a1” basis=“C1”…>)
• But Documents could come from different sources, and don’t know about each
others attributes
• Continual revision of the data model
– Could describe the relationship using RDF, or in an RDF-like manner
•
RDF/n3:
•
•
•
@prefix r1: <file://chGeom.xml#xpointer> .
@prefix r2: <file://chBasis.xml#xpointer> .
@prefix r3: <file://eccpRelations.html#> .
•
<r2:(//basisSet[@id=“C1"])> <r3:isBasisFor> <r1:(//atom[@elementType="C"])> .
•
<r2:(//basisSet[@id=“H1"])> <r3:isBasisFor> <r1:(//atom[@elementType="H"])> .
Presenter
Name
Philip Couch
Facility
Name
e-Science
Relationships
•
•
But…
Passing text is not straight forward – can serialise RDF/n3 to RDF/XML
•
RDF/XML (converted using CWM)
•
•
•
•
<rdf:RDF xmlns:r1="file://chGeom.xml#xpointer"
xmlns:r2="file://chBasis.xml#xpointer"
xmlns:r3="file://chRelations.html#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
•
•
•
•
<rdf:Description rdf:about="r2:(//basisSet[@id="C1"])">
<isBasisFor xmlns="r3:"
rdf:resource="r1:(//atom[@elementType="C"])"/>
</rdf:Description>
•
•
•
•
•
<rdf:Description rdf:about="r2:(//basisSet[@id="H1"])">
<isBasisFor xmlns="r3:"
rdf:resource="r1:(//atom[@elementType="H"])"/>
</rdf:Description>
</rdf:RDF>
Presenter
Name
Philip Couch
Facility
Name
e-Science
Other Design Considerations
• How implicit/explicit should we be
• When should we use ‘general’ or ‘grouping’ tags for data?
– E.g. to take a CML-like example:
Exchange energy = -1.025771783 or
<eexchange>-1.025771783</eexchange> or
<scalar dictRef=“eccp:eexchange”>-1.025771783</scalar>
• To what extent should we tag data
– E.g.
<basisExponents>0.0 0.0 0.0 0.0</basisExponents> or
<basisExponents> <n>0.0</n> <n>0.0</n> <n>0.0></n> <n>0.0</n>
</basisExponents>
Presenter
Name
Philip Couch
Facility
Name
e-Science
Using XML
doc1.xml
doc3.xm
l
doc2.xm
l
Parser
in.txt
Native/foreig
n
Application
libraries
XML
Sax
Dom
Reading:
Convert to standard application input (e.g.
use XSLT)
Read in the XML directly by using
existing/writing your own native/foreign code
Writing:
Parse the standard application output and
convert to XML
out.txt
Meta-data
Write the XML directly by using
existing/writing your own native/foreign code
Parser
out.xml
Presenter
Name
Philip Couch
Facility
Name
e-Science
Using XML - Comments
•
Comments:
– Careful choice of DOM or SAX parser implementation
• DOM – potentially large overheads when used with large model instances
• SAX – difficult to code when data is heavily cross referenced
– Until recently, XML support for FOTRAN has been poor. No existing native
parsers and it’s difficult to write your own
• Solutions
– native FORTRAN XML modules (Alberto Garcia)
– FORTRAN DOM (Jon Wakelin)
– XML libraries such as libXML and Xerces could be used with appropriate wrappers
– There are mixed SAX and DOM API implementations, e.g libXML xmlTextReader
– Parsing standard output is a good option for proprietary code, but suffers from
versioning
– Writing formatted data directly is error prone
• FORTRAN WXML module (Alberto Garcia)
• FORTRAN CML writer (Jon Wakelin)
Presenter
Name
Philip Couch
Facility
Name
e-Science
Automation
• Data models evolve with time. It is hard work to maintain code by
hand. Ideally…
schema
XML
API generator
API
?
wrapper
objects
application
• CML - Java and C++ API generators
• CCPN – Python API generators
• Still have to worry about mapping the wrapper data structures to
the internal data structures of the application.
Presenter
Name
Philip Couch
Facility
Name
e-Science
Data Modelling
• The focus is back to the data model.
• SXD is not easy to interpret, impeding a collaborative approach to
data model design
• Designing is complicated by implementation decisions – it is a good
idea to separate the conceptualisation and implementation
• Represent the data model in the Unified Modelling Language (UML)?
• This is a graphical notation (mainly) for expressing designs.
• Can UML express XSD implementation decisions?
– Yes, through UML stereotypes (subtypes of Meta-model types)
– A UML profile (collection of stereotypes) for schema design has
been developed by David Carlson
Presenter
Name
Philip Couch
Facility
Name
e-Science
UML data model
•
UML equivalent to the XSD geometry and basis set data model
•
UML can be represented as XMI to facilitate the communication of data
models between applications.
Hypermodel will convert XMI to XSD.
•
Presenter
Name
Philip Couch
Facility
Name
e-Science
Binary Data
• Some scientific data would be best stored in binary (e.g. molecular
orbitals)
• Binary data could simply be pointed to by XML
• But…
– Sharing binary data requires a machine independent way of
storing it.
• Could use:
– HDF
– NetCDF
– BinX/DFDL
Presenter
Name
Philip Couch
Facility
Name
e-Science
Current Status
•
Drafting CML-like markup and schema for some computational chemistry
data
– Basis sets
– Molecular orbitals
– Cartesian and internal coordinates
– Molecular vibrations
– Job parameters
– Scalar quantities
•
Setup an eCCP1 Wiki for discussions (grids.ac.uk/eccp)
•
Setup NeSCForge project page for code/data model development
Presenter
Name
Philip Couch
Facility
Name
e-Science
Current Status
•
Developing a C API for parsing CML geometries
– Linked to libXML2
– Designed to scale well with xml file size (uses xmlTextReader)
– Designed to be easily FORTRAN callable
– Transparently reads gzipped XML files
•
Python module written to read in CML1/2 molecular atom information for the
CCP1 GUI
•
GROWL (Grid Resources on Workstation Language), a C API for utilising
current CLRC Grid portal services, is being developed.
Presenter
Name
Philip Couch
Facility
Name
e-Science
Meeting Logistics
Agenda
Monday 5th April
•
•
•
•
•
•
•
•
•
Time
Format
Location
10.00 – 11.10
11.10 - 11.25
11.25 – 12.35
12.35 – 13.35
13.35 – 14.10
14.10 – 15.40
15.40 – 16.00
16.00 – 17.30
19.00
Presentations
Refreshments
Presentations
Lunch
Presentations
Practical session and announcements
Refreshments
Practical session
Conference dinner
Lecture theatre
Presentations
Refreshments
Presentations
Lunch
Open discussions
Refreshments
Open discussions
Meeting close
Lecture theatre
Lecture theatre
Lecture theatre
Lecture theatre, training lab, Cramond
Lecture theatre, training lab, Cramond
Tuesday 6th April
•
•
•
•
•
•
•
•
09.00 – 10.10
10.10 – 10.25
10.25 – 12.10
12.10 – 13.10
13.10 – 14.25
14.25 – 14.40
14.40 – 16.00
16.00
Lecture theatre
Cramond
Cramond
Presenter
Name
Philip Couch
Facility
Name
e-Science
Publication of meeting material
1. Contributed material to be published independently (e.g. NeSC
technical report on CML)
Meeting proceedings (summary) + presentations on web
Create a meeting CD containing presentations and proceedings
2. Contributed material to be published independently
Meeting proceedings (technical report written by authors of
contributed material), focus on existing material and decisions on
way forward.
Create a meeting CD, presentations, proceedings, and code?
Presenter
Name
Philip Couch
Facility
Name
e-Science
Discussion Topics
• How to construct a working group
– Who could be involved?
• Process of data model refinement
• Reference Implementation
– Platforms?
– What are the requirements?
– Man power?
• How focused should we be - data types to include, platforms…etc
• How do we reach a consensus
Presenter
Name
Philip Couch
Facility
Name
e-Science
Download