Common Format Computational Chemistry Looking for a

advertisement
Motivation
Vocabolary
wrappers
Looking for a
(standard) Common
Format
for Computational
(Quantum) Chemistry
A WG activity within COST action 23 (WG D23/0006/01)
– Elda Rossi, Andrew Emerson – CINECA
–Gian Luigi Bendazzoli, Antonio Monari – Univeristà di Bologna
–Renzo Cimiraglia, Celestino Angeli, Stefano Borini - Università di Ferrara
–Daniel Maynau, Stefano Evangelisti - IRSAMC – Toulouse
–José Sanchez-Marin - Universitat de Valencia
–Peter Szalay - Eötvös Loránd University
–Rosa Caballol - Universitat Rovira i Virgili Tarragona
Motivation for the work
Motivation
Vocabolary
wrappers
To build a meta-system for supporting research collaboration in
the field of
“Localised Orbitals in post-SCF methods …
Linear Scaling methods in a Multi-Reference context”
The scenario
Motivation
Vocabolary
wrappers
 Different laboratories need to collaborate
 Different “home-made” codes need to be used together since they give
different views of the same problem
 General purpose “basic” codes needed to pre-compute data in a sort of
pipeline
 Programmes should remain
on their original sites under
the responsibility of
their authors
Different platforms
Network connections
(grid architecture)
Workflow
The need of a Common Format
The first problem we faced:
How different codes (on different platforms)
can communicate
we need a Common Format
for (at least) Quantum Chemistry codes
Motivation
Vocabolary
wrappers
Preliminary steps
 Looking around …
o
o
o
CML available since long time
XML is use by Accelrys for internal files
XML is used by ArgusLab for internal files
All of them not completed suited for computational chemistry
mainly structural chemistry, no Quantum Chemistry properties
 XML
seems the best technology so we took the decision to try
another XML based format
 HDF5
looked nice for storing large binary data typical of QC
Motivation
Vocabolary
wrappers
How should work the engine
Motivation
Vocabolary
wrappers
IN-wrapper


Leaves the program
unchanged
One wrapper for each
program – If a code is
added only one wrapper
to be written
IN-files
Data
Repository
XML/HDF
Program
OUT-files
OUT-wrapper
QCML: an XML format for QC
Motivation
Vocabolary
wrappers
 In order to be as general as possible we need to write down a
hierarchical schema of Quantum Chemistry quantities
 As a first approximation three domains can be identified
 Base FACTS
initial data for describing the physics of the system
 DERIVED
quantities computed from FACTS using QC
algorithms (Energies, Props, integrals, coeff, …)
W-FLOW
which codes are in the pipeline, specific input
data, …
Fact
Parameters
•A base fact is a fact that is a given in the world and is remembered (stored) in the system.
•A derived fact is created by an inference or a mathematical calculation from terms, facts, other
derivations, or even action assertions.
FACT: molecule
Motivation
Vocabolary
wrappers
<system title date program author>
<molecule nElectrons charge spinMultiplicity
spaceSymmetry>
<symmetry> groupName/>
<geometry type unit numAtoms symmetryRef >
<atom symbol isotope x3 y3 z3/>
<basis name type numOrbitals >
<atomBase angularMomMAX symbol >
<angularMom value symbol numOrbitals>
<orbital id numPrimitives>
<exps/> <coeffs/>
–FACTS
–DERIVED
–W-FLOW
Symmetry:
Geometry:
Basis:
group name & other symmetry data
only cartesian, full or unique for sym
by name or fully defined
DERIVED data: computedData
Motivation
Vocabolary
wrappers
<system …>
<computedData>
<energy unit levelOfTheory quality value>
<state spaceSymmetry spinMultiplicity
excitationLevel />
<property unit levelOfTheory quality value>
<state “bra” spaceSymmetry spinMultiplicity
excitationLevel />
<state “ket” spaceSymmetry spinMultiplicity
excitationLevel />
<operator order name/>
<file address URL/>
–FACTS
–DERIVED
–W-FLOW
A “schema” has been written for QCML
DERIVED : computedData/file
Motivation
Vocabolary
wrappers
Problem with large binary datasets
 include the reference not the actual data
Two possible strategies:
1. Leave data in their native format and translate them only
when needed. Maintain different version (formats) of the
same data
2. Define a “standard” format for binary data and convert
them anyway
The second was the solution of choice
 HDF5 appears to be a good solution
HDF Mission
To develop, promote, deploy, and support
open and free technologies that facilitate
scientific data storage, exchange, access,
analysis and discovery.
•
•
•
•
•
•
Format and software for scientific data
Stores images, multidimensional arrays, tables, etc.
Emphasis on storage and I/O efficiency
Free and commercial software support
Emphasis on standards
Users from many engineering and scientific fields
Motivation
Vocabolary
wrappers
Motivation
Vocabolary
wrappers
Example HDF5 file
“/” (root)
“/MO”
“/MO”
“/mono”
“/AO”
“/bi”
“/mono”
“/coefficients”
Orb | occ | energy
----|-----|----1 |
0 | 0.35
2 | 0.5| 0.26
3 | 2. | 0.69
Overlap
Kinetic
“/bi”
4-D
array
Repulsion
Property
Table
Kinetic+
Repulsion
HDF file structure for QC
Root
Name
QCML_ref
Norb
 AO





Norb
 MO
Spin Polar.:
Orb Classif:
a=b
a
b
Core
Active
Virtual
Orb Energies:
Orb Symm: [1-order]
 Property
<i/j>
<i/T/j>
<i/Vnuc/j>
<i/T/j>+<i/Vnuc/j>
<ij/kl>




<i/T/j>
<i/V/j>
<i/T/j>+<i/Vnuc/j>
<ij/kl>

coeff(i,j)

<i/p/j>
+ format metadata (integer, binary, Endian-ism, …)
Motivation
Vocabolary
wrappers
QCML processing: wrappers
Motivation
Vocabolary
wrappers
 One couple of wrappers for each code in the metasystem
 They should be written & maintained by the authors of the
chemical codes
 XML processing can be used (DOM) but … what language???
o Fortran: no easy and stable DOM available
o Scripting languages (Perl/Python/Java): not known by chemists
 We tried both ways (Fortran & Python)
Fortran DOM: drawbacks
Motivation
Vocabolary
wrappers
 The only problem is the Fortran binding
o It doesn’t exist (at least last year …)
o DOM is OO and Fortran is not
 It exists a C binding (Gdome2)
 Gdome2 was installed – very hard work – on a
mainframe platform (it was conceived for Linux)
 We are currently converting it to Fortran, by
adopting the DOM recommendations (simplified …)
Why Fortran
Motivation
Vocabolary
wrappers
GOOD
•Users don't need to learn a new language
•Homogeneous environment
BAD
•Tricky: need an external library (f77xml) built on top of
gdome2
•Porting problems for gdome2/libxml2 may arise
F77xml library
Motivation
Vocabolary
wrappers
 Still in development
o v0.4 is out (experimental, with limited features)
o v1.0 upcoming, API changed to be nearly DOM2 compliant
 Written in C on top of gdome2 http://gdome2.cs.unibo.it/index.html
 Designed for interfacing to F77 (also F90 soon)
 Reduced namespace pollution
Cons:
● F77 syntax is difficult (DOM2 + tricks)
● F90 syntax is simpler
● A pre-processor will convert F90 syntax to F77
http://freshmeat.net/projects/f77xml
F77xml library - V1.0 example
Motivation
Vocabolary
wrappers
Gdome2 (C)
GdomeNode* gdome_el_firstChild (GdomeElement *self,
GdomeException *exc);
F90
Call f77xml_el_firstChild(nodeCode, elemCode, exc)
First position:
Return value
NodeCode, elemCode,exc
mapped to INTEGER
F77
Func='el_firstChild'
Call xp3t1(nodeCode,func,elemCode,exc)
Multiplexer function:
x:
p3: 3 parameters (+ name function)
t1: type 1 parameter schema (code/code/error)
Why Python
GOOD
 Very Easy Object Oriented Language
 Works well with strings
 Simple ed efficient DOM interface for XML
 Present in almost all UNIX/LINUX distribution
BAD
 Users do need to learn a new language
 Maybe less powerful than Perl
 Usually not used by chemists
Motivation
Vocabolary
wrappers
Python Wrapper
Motivation
Vocabolary
wrappers
At the present a prototype does work with molpro-fci chain.
It takes information from xml-repository
Writes down proper MOLPRO and FCI input
Starts the two programs
With a different XML file users should only specify the
file name and some simple parameters (orbital guess
for FCI)
Python or not
Motivation
Vocabolary
wrappers
 Python is very simple to learn and works very
efficiently with xml
 Scripts written in Python (at least for prototypes)
are quite clear, linear and easy to maintain or
upgrade
 Possibility of a GUI could make our project much
more user-friendly
What we have done …
Single platform:
IBM SP4
Two code chains


MolPro
IN-file
IN-wrapper
MolPro
MolPro to FCI
MolPro to CasDI
OUT-wrapper
FCIDUMP
Start
here
QCML
Repository
HDF5
Repository
IN-wrapper
IN-wrapper
FCI
IN-file
Bin file
for FCI
FCI
Stop
here
In conclusion …
Two important hints on data…
1.Use some XML dialect for describing simple structured data
2.Use HDF5 for storing large array and binary data
Need of a good and easy API to XML & HDF
How to manage the workflow
How to manage the grid connection
Download