Markup in Atomic and Molecular Simulations: Implementation & Issues Jon Wakelin

advertisement
Markup in Atomic and Molecular
Simulations: Implementation & Issues
Jon Wakelin
Dept. Earth Sciences
University of Cambridge
Overview
Background – The problem
 A solution
 An Implementation
 Demo
 Summary

Background (1)

The computational chemistry and physics communities
have more data than ever before






Advances in Computer power
Access to HPC facilities
Algorithmic & scientific advances
Better exploitation of existing facilities (Grid)
High throughput computing (Condor)
Same factors have lead to qualitative changes in data

Can now attempt new kinds of calculations
Background (2)

The majority of this data is reused





Starts in a database
Passed into a program
Post-processed
Visualized, etc…
Most notably…




Structures/Coordinates
Forcefields/Interatomic potentials
Basis Sets
Pseudopotentials
Background (3)

So the nature of our data has changed but the way we
deal with it has not



Still rely on bespoke text and binary formats
Issues such as interoperability, data management and data
reuse are tackled in an informal or ad hoc manner
Binary markup languages (NetCDF, HDF)
A solution

XML






Allows the user to describe data of arbitrary structure
Or… allows the user to structure his/her data arbitrarily
Provides us with a known format (i.e. it is easy to parse)
Many free tools and standards
~7 years old, so fairly well road tested
CML (Chemical Markup Language)


Extensions to CML core for simulations - CMLComp
CML is not tied to a particular chemistry or physics program
What will markup do for us?

Facilitate data exchange




Make data producers more accountable



Between chemistry and physics software, but also…
Easier to extract data to databases
Facilitates other tasks such as data-mining
Schemas and related technologies
Dictionaries
Reduce Software development (eventually)



No need to support multiple formats
No need to write ‘converters’
Standard libraries for processing CML
Data Exchange (examples)








Equilibrate MD in DLPOLY then continue in SIESTA
Visualize output from Gaussian in Jmol
Compare timings between VASP and CASTEP
Take structure from ICSD and relax in SIESTA
Develop forcefield in GULP use in DLPOLY
Calculate property X in Dalton and property Y in GAMES
And so on… in fact while these examples should be familiar to us
all, they are essentially trivial, however…
Grid/Condor facilitate hi-throughput computing



Often want to create complex workflow schemes
E.g. using Condor’s DAG Manager
But there is no prescription for how to handle the data as it ‘flows’
In.xml
In.xml
Parse
r
Parse
r
In.txt
In.txt
In.txt
In.xml
COD
E
COD
E
COD
E
COD
E
Out.txt
Out.txt
Out.xm
l
Out.xm
l
Parse
r
Design 2
Out.xm
l
Design 1
Design 3
Design 1





Only option when you don’t have access to source code
Input: XSL or program using SAX, DOM
Output: JumboMarker
Programs Using this Design: MOPAC, Gaussian
Pros & Cons
+
+
-
Generality – it will work for any code!
Don't need access to the source code
Requires more user intervention
Parsing text to create XML!
Need to know all combinations that the code can throw at you
Is at the mercy of changes to the output by the code
developers.
Design 2

When you have access to the source code
 When you are using Fortran
 Input: XSL, program with SAX, DOM
 Output: Jumbo90, WXML
 Examples: SIESTA, GULP, DLPOLY
 Pros and Cons
+
+
+
-
Avoid Tricky text => XML conversion
Only have to maintain a single program
Simpler from point of view of end user
End user still has to convert CML => text
Design 3




When you have access to the source code
Input/Output: DOM
Examples: Jmol, JChemEdit, openBabel
Pros and Cons
+ Simplest for end user
- Most Chem/Phys programs still written in Fortran

Limited XML support for Fortran
- CML is not the file format of your program



A CML file is not guaranteed to contain all the info you need
Alternatively it may contain to much
“Towards a common data and command …”
Implementation - Output

An F90 library for creating well-formed XML


An F90 library for formatting CML





WXML (A. Garcia)
Jumbo90
Provides convenience routines for creating CML elements
Has been used in SIESTA, GULP, DLPOLY
We should look to auto-generate these libraries
But output is the easy part...
Implementation - Input


Could link to libxml2 (C Library)
Could implement SAX or DOM in Fortran





Several groups have tried this
A. Garcia has an F90 SAX parser
We have built an F95 DOM parser on top of this
Currently supports DOM 1.0
Could we go one step further?



Could we implement a CML-DOM in Fortran?
Generic W3 DOM Vs. language specific DOM
E.g. MathML-DOM, SVG-DOM, CML-DOM
XML as a tree
<person>
<name>
<fst>Jon</fst>
<sec>Smith</sec>
</name>
<stats>
<height>20<height>
<weight>60</weight>
</stats>
</person>
Perso
n
Nam
e
Fst
Jon
Sec
Smith
Stats
Heigh
t
20
Wiegh
t
60
Generic DOM Tree
Elemen
t
= person
Elemen
Elemen
t
t
= stats
Elemen
Elemen
Elemen
Elemen
t
t
t
t
Text
Text
Text
= weight
Text = 60
Generic DOM

Implementation in F95




Inheritance Vs. flattened view
Similarities with C’s libxml implementation
Using Linked-lists/pointers
Functions return pointer to data structures


Things to do




Remember to use pointer syntax!!!!
No Validation
No Xpath
No 16 bit strings
Benefits


Portable
Live nodes
Demo




siesta.xml – H2O
siesta.html – H2O
siesta.html – Pyrophyllite
gulp.html – Al/Cu cluster
Summary

Began with three Observations:




Changing the way we deal with data, will:




Quantitative and Qualititative changes in our Data
Data exchange is essential (even in the simplest calculation)
Bespoke data formats and ad hoc solutions for data exchange
Facilitate data exchange and interoperability
Make data and data producers more accountable
Reduce code development (but not yet)
Implementation



Design depends on: access to source, programming language
Output – Jumbo90/WXML
Input – F90 implementations of SAX/DOM/CML-DOM
Acknowledgments
P. Murray-Rust & A. Garcia
 NERC

Download